[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860 --- Comment #9 from Tamar Christina --- (In reply to prathamesh3492 from comment #8) > Hi Tamar, > Using -falign-loops=5 indeed brings back the performance. > The adrp instruction has same address (0x4ae784) by setting -falign-loops=5 > (which reduces misalignment to 4) with/without a2f4be3dae0. So I guess this > is really code-alignment issue ? > Indeed, we don't aggressively align loops if they require too much padding to not bloat the binaries too much. That's why sometimes you just get unlucky and the hot loop gets misaligned.
[Bug tree-optimization/115130] (early-break) [meta-bug] early break vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130 Tamar Christina changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2024-05-17 Status|UNCONFIRMED |NEW
[Bug tree-optimization/115130] New: (early-break) [meta-bug] early break vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130 Bug ID: 115130 Summary: (early-break) [meta-bug] early break vectorization Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: meta-bug, missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Blocks: 53947 Target Milestone: --- Meta tickets about early break vectorization to better keep track of early break specific issues Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120 --- Comment #3 from Tamar Christina --- That makes sense, though I also wonder how it works for scalar multi exit loops, IVops has various checks on single exits. I guess one problem is that the code in IVops that does this uses the exit to determine niters. But in the case of the multiple exits vector code the vectorizer could have picked a different exit. So I guess the question is how do we even tell which one is used or could the transformation be driven from the PHI nodes themselves instead of an exit.
[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860 --- Comment #7 from Tamar Christina --- Yeah, it's most likely an alignment issue, especially as there's no code changes. We run our benchmarking with different flags so it may be why we don't see it. the loop seems misaligned, you can try increasing the alignment guarantee to check. e.f. -falign-loops=5. But ultimately, I think it's just bad luck. We don't align loops and labels if they require too much padding instructions.
[Bug target/114412] [14/15 Regression] 7% slowdown of 436.cactusADM on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114412 --- Comment #5 from Tamar Christina --- (In reply to Filip Kastl from comment #4) > (In reply to Tamar Christina from comment #3) > > Hi Filip, > > > > Do you generate these runs with counters based PGO or compiler > > instrumentation? > > > > Just so I know before I start trying to reproduce them. > > Hi Tamar, > > By counters you mean some sort of hardware counters? I didn't know there > were multiple ways to do PGO with GCC. > > I think that the answer to your question is "compiler instrumentation". I > just do -fprofile-generate, run the instrumented binary and then > -fprofile-use. Yeah, with some elbow grease the perf record method works too, but it's not very accurate on Armv8. I'll try to reproduce and bisect these over the weekend!
[Bug target/115087] New: dead block not eliminated in SVE intrinsics code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115087 Bug ID: 115087 Summary: dead block not eliminated in SVE intrinsics code Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The testcase in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151 had another "regression" in that the same loop seems to have been peeled but only for 1 iteration. However the loops are identical so the peeling is weird. This was caused by f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75 is the first bad commit commit f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75 Author: Jan Hubicka Date: Fri Jul 28 16:18:32 2023 +0200 loop-split improvements, part 3 extend tree-ssa-loop-split to understand test of the form if (i==0) and if (i!=0) which triggers only during the first iteration. Naturally we should also be able to trigger last iteration or split into 3 cases if the test indeed can fire in the middle of the loop. however the commit is innocent, it looks like we're not below a magic threshold that causes the issue. However a simpler testcase: #include void test(int size, char uplo, float16_t *p_mat) { int col_stride = uplo == 'u' ? 1 : size; auto *a = _mat[0]; auto pg = svptrue_b16(); for (int j = 0; j < size; ++j) { auto *a_j = [j]; if (j > 0) { int col_i = j + 1; auto v_a_ji_0 = svld1_vnum_f16(pg, (const float16_t *)_j[col_i], 0); v_a_ji_0 = svcmla_f16_x(pg, v_a_ji_0, v_a_ji_0, v_a_ji_0, 180); } int col_i = j * col_stride; auto v_a_ji_0 = svld1_vnum_f16(pg, (const float16_t *)_j[col_i], 0); auto v_old_a_jj_0 = svld1_vnum_f16(pg, (const float16_t *)_j[j], 0); v_a_ji_0 = svmul_f16_x(pg, v_old_a_jj_0, v_a_ji_0); svst1_vnum_f16(pg, (float16_t *)_j[col_i], 0, v_a_ji_0); } } shows that the change in the patch is a positive one. The issue seems to be that GCC does not see the if block as dead code: if (j > 0) { int col_i = j + 1; auto v_a_ji_0 = svld1_vnum_f16(pg, (const float16_t *)_j[col_i], 0); v_a_ji_0 = svcmla_f16_x(pg, v_a_ji_0, v_a_ji_0, v_a_ji_0, 180); } is dead because v_a_ji_0 is overwritten before use. _29 = MEM <__SVFloat16_t> [(__fp16 *)_88 + ivtmp.10_52 * 2]; svcmla_f16_x ({ -1, 0, ... }, _29, _29, _29, 180); _29 is dead, but I guess it's not eliminated because it doesn't know what svcmla_f16_x does. But are these intrinsics not marked as CONST|PURE ? We finally eliminate it at RTL level but I think we should mark these intrinsics as ECF_CONST
[Bug target/114412] [14/15 Regression] 7% slowdown of 436.cactusADM on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114412 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #3 from Tamar Christina --- Hi Filip, Do you generate these runs with counters based PGO or compiler instrumentation? Just so I know before I start trying to reproduce them.
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 Tamar Christina changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-05-13 Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #8 from Tamar Christina --- (In reply to Richard Biener from comment #7) > Likely > > Base: (integer(kind=4) *) + ((sizetype) ((unsigned long) l0_19(D) * > 324) + 36) > > vs. > > Base: (integer(kind=4) *) + ((sizetype) ((integer(kind=8)) l0_19(D) > * 81) + 9) * 4 > > where we fail to optimize the outer multiply. It's > > ((unsigned)((signed)x * 81) + 9) * 4 > > and likely done by extract_muldiv for the case of (unsigned)x. The trick > would be to promote the inner multiply to unsigned to make the otherwise > profitable transform valid. But best not by enhancing extract_muldiv ... Ah, merci! Mine then.
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #6 from Tamar Christina --- Created attachment 58096 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58096=edit exchange2.fppized-bad.f90.187t.ivopts
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #5 from Tamar Christina --- Created attachment 58095 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58095=edit exchange2.fppized-good.f90.187t.ivopts
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #4 from Tamar Christina --- reduced more: --- module brute_force integer, parameter :: r=9 integer block(r, r, 0) contains subroutine brute do do do do do do do i7 = l0, 1 select case(1 ) case(1) block(:2, 7:, 1) = block(:2, 7:, i7) - 1 end select do i8 = 1, 1 do i9 = 1, 1 if(1 == 1) then call digits_20 end if end do end do end do end do end do end do end do end do end do end end --- I'll have to stop now till I'm back, but the main difference seems to be in: good: : IV struct: SSA_NAME: _1 Type: integer(kind=8) Base: (integer(kind=8)) ((unsigned long) l0_19(D) * 81) Step: 81 Biv: N Overflowness wrto loop niter: Overflow IV struct: SSA_NAME: _20 Type: integer(kind=8) Base: (integer(kind=8)) l0_19(D) Step: 1 Biv: N Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: i7_28 Type: integer(kind=4) Base: l0_19(D) + 1 Step: 1 Biv: Y Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: vectp.22_46 Type: integer(kind=4) * Base: (integer(kind=4) *) + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36) Step: 324 Object: (void *) Biv: N Overflowness wrto loop niter: No-overflow bad: : IV struct: SSA_NAME: _1 Type: integer(kind=8) Base: (integer(kind=8)) l0_19(D) * 81 Step: 81 Biv: N Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: _20 Type: integer(kind=8) Base: (integer(kind=8)) l0_19(D) Step: 1 Biv: N Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: i7_28 Type: integer(kind=4) Base: l0_19(D) + 1 Step: 1 Biv: Y Overflowness wrto loop niter: No-overflow IV struct: SSA_NAME: vectp.22_46 Type: integer(kind=4) * Base: (integer(kind=4) *) + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 9) * 4 Step: 324 Object: (void *) Biv: N Overflowness wrto loop niter: No-overflow
[Bug tree-optimization/114932] Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #3 from Tamar Christina --- (In reply to Andrew Pinski from comment #2) > > which is harder for prefetchers to follow. > > This seems like a limitation in the HW prefetcher rather than anything else. > Maybe the cost model for addressing mode should punish base+index if so. > Many HW prefetchers I know of are based on the final VA (or even PA) rather > looking at the instruction to see if it increments or not ... That was the first thing we tried, and even increasing the cost of register_offset to something ridiculously high doesn't change a thing. IVopts thinks it needs to use it and generates: _1150 = (voidD.26 *) _1148; _1152 = (sizetype) l0_78(D); _1154 = _1152 * 324; _1156 = _1154 + 216; # VUSE <.MEM_421> vect__349.614_1418 = MEM [(integer(kind=4)D.9 *)_1150 + _1156 * 1 clique 2 base 0]; Hence the bug report to see what's going on.
[Bug tree-optimization/114932] New: Improvement in CHREC can give large performance gains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 Bug ID: 114932 Summary: Improvement in CHREC can give large performance gains Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- With the original fix from PR114074 applied (e.g. g:a0b1798042d033fd2cc2c806afbb77875dd2909b) we not only saw regressions but saw big improvements. The following testcase: --- module brute_force integer, parameter :: r=9 integer block(r, r, r) contains subroutine brute k = 1 call digits_2(k) end recursive subroutine digits_2(row) integer, intent(in) :: row logical OK do i1 = 0, 1 do i2 = 1, 1 do i3 = 1, 1 do i4 = 0, 1 do i5 = 1, select do i6 = 0, 1 do i7 = l0, u0 select case(1 ) case(1) block(:2, 7:, i7) = block(:2, 7:, i7) - 1 end select do i8 = 1, 1 do i9 = 1, 1 if(row == 5) then elseif(OK)then call digits_2(row + 1) end if end do end do block(:, 1, i7) = select end do end do end do end do end do block = 1 end do block = 1 block = block0 + select end do end end --- compiled with: -mcpu=neoverse-v1 -Ofast -fomit-frame-pointer foo.f90 gets vectorized after sra and constprop. But the final addressing modes are so complicated that IVopts generates a register offset mode: 4c: 2f00041dmvniv29.2s, #0x0 50: fc666842ldr d2, [x2, x6] 54: fc656841ldr d1, [x2, x5] 58: fc646840ldr d0, [x2, x4] 5c: 0ebd8442add v2.2s, v2.2s, v29.2s 60: 0ebd8421add v1.2s, v1.2s, v29.2s 64: 0ebd8400add v0.2s, v0.2s, v29.2s which is harder for prefetchers to follow. When the patch was applied it was able to correctly lower these to the immediate offset loads that the scalar code was using: 38: 2f00041dmvniv29.2s, #0x0 34: fc594002ldurd2, [x0, #-108] 40: fc5b8001ldurd1, [x0, #-72] 44: fc5dc000ldurd0, [x0, #-36] 48: 0ebd8442add v2.2s, v2.2s, v29.2s 4c: 0ebd8421add v1.2s, v1.2s, v29.2s 50: 0ebd8400add v0.2s, v0.2s, v29.2s and also removes all the additional instructions to keep x6,x5 and x4 up to date. This gave 10%+ improvements on various workloads. (ps I'm looking at the __brute_force_MOD_digits_2.constprop.3.isra.0 specialization). I will try to reduce it more, but am filing this so we can keep track and hopefully fix.
[Bug ipa/92538] Proposal for IPA init() constant propagation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92538 Tamar Christina changed: What|Removed |Added CC||jamborm at gcc dot gnu.org --- Comment #7 from Tamar Christina --- Hi Martin, this is the IPA pass I was talking about at the cauldron. I see the patch doesn't have any standalone testcases. Do you have some Erick?
[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860 --- Comment #3 from Tamar Christina --- I cannot reproduce this even recompiling libc.
[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #2 from Tamar Christina --- There's no change in the binaries aside from an alignment change in one of the lesser hot functions. I guess you recompile libc? In which case you need to compare against GCC 13. As mentioned in the quoted commit this reverts the general case for GCC 14.
[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860 --- Comment #1 from Tamar Christina --- Hmm I Am unable to reproduce this with -O3 - flto -mcpu=neoverse-v2 on a neoverse-v2 machine. Is any other option required? Also that code was new in gcc 14 and was partially reverted due to register allocation issue. So if there is a performance difference it's a miss optimization and not a regression
[Bug rtl-optimization/114766] ^ constraint modifier unexpectedly affects register class selection.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766 --- Comment #2 from Tamar Christina --- (In reply to Vladimir Makarov from comment #1) > (In reply to Tamar Christina from comment #0) > > The documentation for ^ states: > > If it works for you, we could try to use the patch (although it needs some > investigation how other targets uses the hint). In any case, the > documentation should be modified or made more clear depending on applying or > not applying the patch. Yeah, using the patch gives us the behavior we expected, we added a workaround for now so we can investigate what other targets do in GCC 15. But while looking at this we also got some unexpected behavior with using ? For instance we have the pattern: ;; Equal width integer to fp conversion. (define_insn "2" [(set (match_operand:GPF 0 "register_operand") (FLOATUORS:GPF (match_operand: 1 "register_operand")))] "TARGET_FLOAT" {@ [ cons: =0 , 1 ; attrs: type , arch ] [ w, w ; neon_int_to_fp_ , simd ] cvtf\t%0, %1 [ w, ?r ; f_cvti2f, fp] cvtf\t%0, %1 }) for modeling floating point conversions. We had expected ? to make the alternative more expensive, but still possible. However again during IRA the entire register class is blocked: r103: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS I would have expected that the additional penalty should never make an alternative impossible. We thought maybe the move costs were an issue, but we tried with various big and small numbers but it looks like the move costs have little to no effect. Since the original value in this case was in r: Choosing alt 0 in insn 7: (0) =rk (1) %rk (2) I {*adddi3_aarch64} I would have expected the cost for FP_REGS not to be 0, as there's a register file move involved: r103 costs: W8_W11_REGS:2000 W12_W15_REGS:2000 TAILCALL_ADDR_REGS:2000 STUB_REGS:2000 GENERAL_REGS:2000 FP_LO8_REGS:0 FP_LO_REGS:0 FP_REGS:0 POINTER_AND_FP_REGS:7000 MEM:9000 In this particular pattern the ? isn't needed so we're removing it, but the behavior is still unexpected.
[Bug tree-optimization/114769] [14 Regression] Suspicious code in vect_recog_sad_pattern() since r14-1832
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114769 Tamar Christina changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Tamar Christina --- (In reply to Feng Xue from comment #3) > When half_type is null, and call chain would be: > > vect_supportable_direct_optab_p (vinfo, sum_type, SAD_EXPR, NULL, ...) > -> get_vectype_for_scalar_type (vinfo, NULL) > -> get_related_vectype_for_scalar_type (vinfo, /*scalar_type= */NULL, > ...) > >if ((!INTEGRAL_TYPE_P (scalar_type) >&& !POINTER_TYPE_P (scalar_type) >&& !SCALAR_FLOAT_TYPE_P (scalar_type)) > > Here will be a segfault. Sure, I wasn't able to trigger that even disabling the optab. In any case this version doesn't get there anymore. So fixed, thanks for the report!
[Bug tree-optimization/114769] [14 Regression] Suspicious code in vect_recog_sad_pattern() since r14-1832
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114769 --- Comment #2 from Tamar Christina --- I believe this is safe, but the interface is definitely not the cleanest. vect_recog_absolute_difference has two callers: 1. vect_recog_sad_pattern where if you return true with unprom not set, then *half_type will be NULL. The call to vect_supportable_direct_optab_p will always reject it since there's no vector mode for NULL. Note that if looking at the dump files, the convention in the dump files have always been that we first indicate that a pattern could possibly be recognize and then check that it's supported. This change somewhat incorrectly makes the diagnostic message get printed for "invalid" patterns. 2. vect_recog_abd_pattern, where if half_type is NULL, it then uses diff_stmt to set them. So while the note in the dump file is misleading, the code is safe. I will however slightly refactor it to prevent the confusion.
[Bug target/113625] Interesting behavior with and without -mcpu=generic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113625 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #4 from Tamar Christina --- Generic is not the default for aarch64 and the default depends on the architecture revision. The documentation patch got stuck on bike shedding on the list. I'll resubmit it tomorrow. Generic was kept around for backwards compatibility reasons. That said, both codegen below are suboptimal...
[Bug rtl-optimization/114766] New: ^ constraint modifier unexpectedly affects register class selection.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766 Bug ID: 114766 Summary: ^ constraint modifier unexpectedly affects register class selection. Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: vmakarov at gcc dot gnu.org Target Milestone: --- The documentation for ^ states: "This constraint is analogous to ‘?’ but it disparages slightly the alternative only if the operand with the ‘^’ needs a reload." >From this we gathered that there's only a slight costs when the given operand required a reload. In PR114741 we had a regression when using this modifier because it seems to unexpectedly also change register class selection during coloring. The pattern was: (define_insn "3" [(set (match_operand:GPI 0 "register_operand") (LOGICAL:GPI (match_operand:GPI 1 "register_operand") (match_operand:GPI 2 "aarch64_logical_operand")))] "" {@ [ cons: =0 , 1 , 2; attrs: type , arch ] [ r, %r , r; logic_reg , * ] \t%0, %1, %2 [ rk , ^r , ; logic_imm , * ] \t%0, %1, %2 [ w, 0 , ; * , sve ] \t%Z0., %Z0., #%2 [ w, w , w; neon_logic , simd ] \t%0., %1., %2. } ) where we wanted to prefer the r->r alternative unless a reload is needed in which case we preferred the w->w. But in the simple example of: void foo(unsigned v, unsigned *p) { *p = v & 1; } where the operation can be done on both r->r and w->w, but w->w needs a mov it would not pick r->r. This is because during sched1 the penalty applied to the ^ alternative made it no longer consider GP regs: r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS ;;3--> b 0: i 9 r106=r105&0x1 :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0) PR_HI_REGS+0(0):model 4 The penalty here seems incorrect, and removing it seems to get the constraint to work properly. So the question is, is it a bug, or are we using it incorrectly? or a documentation bug?
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 Tamar Christina changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from Tamar Christina --- Fixed, thanks for the report.
[Bug target/114513] [11/12/13/14 Regression] [aarch64] floating-point registers are used when GPRs are preferred
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114513 Bug 114513 depends on bug 114741, which changed state. Bug 114741 Summary: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 Tamar Christina changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #8 from Tamar Christina --- Indeed, Regtesting a patch that fixes it, so mine... It looks like ^ doesn't work well when there's a choice of multiple register files.
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 --- Comment #6 from Tamar Christina --- and the exact armv9-a cost model you quoted, also does the right codegen. https://godbolt.org/z/obafoT6cj There is just an inexplicable penalty being applied to the r->r alternative.
[Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org, ||vmakarov at gcc dot gnu.org --- Comment #5 from Tamar Christina --- (In reply to Andrew Pinski from comment #4) > (In reply to Wilco from comment #2) > > It looks like the underlying bug is '^' being incorrectly treated like '?' > > in record_reg_classes (which is never used during reload). Fixing that > > results in the expected code being generated in all cases. It looks this > > issue was introduced in the original commit > > d1457701461d5a49ca6b5d8a6d1c83a37a6dc771 > > static const struct cpu_regmove_cost generic_armv9_a_regmove_cost = > { > 1, /* GP2GP */ > /* Spilling to int<->fp instead of memory is recommended so set > realistic costs compared to memmov_cost. */ > 3, /* GP2FP */ > 2, /* FP2GP */ > 2 /* FP2FP */ > }; > > > Note these costs are broken. > TARGET_REGISTER_MOVE_COST has this to say about the special value 2: > ``` > If reload sees an insn consisting of a single set between two hard > registers, and if TARGET_REGISTER_MOVE_COST applied to their classes returns > a value of 2, reload does not check to ensure that the constraints of the > insn are met. Setting a cost of other than 2 will allow reload to verify > that the constraints are met. You should do this if the ‘movm’ pattern’s > constraints do not allow such copying. > ``` > > The way I implemented this for thunderx was have GP2GP being cost of 2 and > then be relative from there. That gave better code generation in general and > even less spilling. I know I wrote about this before too. I don't think this is related to this at all the old generic costs, which armv8 was taken from doesn't use 2, and is broken https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/tuning_models/generic.h#L42 generic-armv8-a doesn't use 2 and is broken https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/tuning_models/generic_armv8_a.h#L43 neoverse-n2 uses 2 and isn't broken https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/tuning_models/neoversen2.h I think the issue is what Wilco mentioned before. The GCC docs claim that ^ shouldn't affect initial costing/IRA, but it clearly does. it's penalizing the r->r alternative during initial costing and not just during lra if a reload is needed. so the question to Vlad is it correct that ^ is treated the same way as ? during initial costing? i.e. it's penalizing the alternative.
[Bug tree-optimization/113552] [11/12/13 Regression] vectorizer generates calls to vector math routines with 1 simd lane.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113552 Tamar Christina changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #15 from Tamar Christina --- Fixed on trunk and all open branches
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Tamar Christina changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #29 from Tamar Christina --- Fixed, thanks for the report!
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #26 from Tamar Christina --- (In reply to Richard Biener from comment #25) > That means, when the loop takes the early exit we _must_ take that during > the vector iterations. Peeling for gaps means if we would take the early > exit during one of the gap peeled iterations this is a conflicting > requirement. > Now - the current analysis guarantees that the early exit conditions can > be safely evaluated even for the gap iterations, but not the following > code when the early exit is _not_ taken. > > So peeling for gaps and early exit vect are not compatible? I don't see why not, as my email explains for the early exits we always go to the scalar loop, which already adheres to the condition of peeling for gaps. I just think that peeling for gaps should not force it to exit from the main exit.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #24 from Tamar Christina --- (In reply to Richard Biener from comment #23) > Maybe easier to understand testcase: > > with -O3 -msse4.1 -fno-vect-cost-model we return 20 instead of 8. Adding > -fdisable-tree-cunroll avoids the issue. The upper bound we set on the > vector loop causes us to force taking the IV exit which continues > with i == (niter - 1) / VF * VF, but 'niter' is 20 here. yes,indeed, that's what my patch was arguing last time, but I didn't explain it well enough. I'm about to send out v2 (waiting for regtest to finish) which hopefully articulates this better.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #22 from Tamar Christina --- note that due to the secondary exit the actual full vector iteration count is 8 scalar elements at VF=4 == 2. And it's this boundary condition where we fail, since ceil (8/4) == 2. any other value would have done the partial vector iteration. Basically final_iter_may_be_partial ends up being ignored.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #21 from Tamar Christina --- Created attachment 57932 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57932=edit loop.c attached reduced testcase that reproduces the issue and also checks the buffer position and copied values. As discussed on IRC when peeling for gaps we need to either adjust the upper bounds of the vector loop or force the vector loop to get to the scalar loop. However we already go to the scalar loop, just with the wrong induction value because we were never supposed to take the main exit. whether go to the scalar loop depends on x = (((ptr2 - ptr1) - 16) / 16) + 1 x == (((x - 1) >> 2) << 2) in this case x == 26, so we do go to the scalar code already, but through the main exit. exiting through the main exit assumes you've done all vector iterations, in this case 6 iterations based on the main exit condition which is first != last. In this case the inductions values will be set on niters_vector_mult. so in this case first += 24 But that's wrong since the secondary exit has a known iteration count of 9, due to (buffer_ptr + store_size) <= buffer_end. Statement (exit)if (ivtmp_21 != 0) is executed at most 8 (bounded by 8) + 1 times in loop 1. So we will always exit through it as 9 < 24. that means that when we calculate the upper bounds of the vector loop, we must add a bias so that in this boundary condition that we do an extra partial vector iteration. I think the discussion on IRC went off track for a bit and hopefully this testcase and the explanation above shows that for all early break and all epilogue peeling reasons, we must bias up for the upper bound to give the secondary exits a chance to trigger. So really do think the correct patch is: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4375ebdcb49..0973b952c70 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12144,6 +12144,9 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) -min_epilogue_iters to remove iterations that cannot be performed by the vector code. */ int bias_for_lowest = 1 - min_epilogue_iters; + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) +bias_for_lowest = 1; + int bias_for_assumed = bias_for_lowest; int alignment_npeels = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); if (alignment_npeels && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) for the reasons described above. There's no way for us to take the main exit, which signifies (we've reached the end of all iterations we can possibly do as vector) and get the correct induction values in this case.
[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 --- Comment #6 from Tamar Christina --- (In reply to Jakub Jelinek from comment #4) > Now, with SVE/RISCV vectors the actual vectorization factor is a poly_int > rather than constant. One possibility would be to use VLA arrays in those > cases, but then it will be hard to undo that later, or allow both shrinking > and growing the arrays and even turning them into VLA-like ones. I think they are already VLAs of some kind going into vect, unless I've misunderstood the declaration: float D.28295[0:POLY_INT_CST [15, 16]]; float D.28293[0:POLY_INT_CST [15, 16]]; float D.28290[0:POLY_INT_CST [15, 16]]; it looks like during vectorization they are lowered though.
[Bug tree-optimization/114635] New: OpenMP reductions fail dependency analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635 Bug ID: 114635 Summary: OpenMP reductions fail dependency analysis Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following testcase reduced from an HPC workload: #include #define RESTRICT restrict void work(int n, float *RESTRICT x, float *RESTRICT y, float *RESTRICT z, float *RESTRICT mass, float x0, float y0, float z0, float *RESTRICT ax, float *RESTRICT ay, float *RESTRICT az) { float lax = 0.0f, lay = 0.0f, laz = 0.0f; #if _OPENMP >= 201307 #pragma omp simd reduction(+:lax,lay,laz) #endif for (int i = 0; i < n; ++i) { float dx = x[i] - x0; float dy = y[i] - y0; float dz = z[i] - z0; float r2 = dx + dy + dz; if (r2 == 0.0f) continue; float f = (1.0f / (r2 * sqrtf(r2))) * mass[i]; lax += f * dx; lay += f * dy; laz += f * dz; } *ax += lax; *ay += lay; *az += laz; } when compiled with -Ofast -march=armv9-a -fopenmp-simd vectorizes as expected but when the pragma is in effect, e.g. -Ofast -march=armv9-a -fopenmp then the main loop fails to vectorize with: (compute_affine_dependence ref_a: D.5962[_33], stmt_a: _69 = D.5962[_33]; ref_b: D.5962[_33], stmt_b: D.5962[_33] = _ifc__147; ) -> dependence analysis failed /app/example.c:16:17: missed: bad data dependence. /app/example.c:16:17: note: * Analysis failed with vector mode VNx4SF This doesn't seem to happen with just 2 reductions, but with 3 dependency analysis seems to fail. I don't know much about openmp but my understanding is that this pragma is intended for architectures that don't have masking support and works by splitting the loop and removing the reductions from the main loop creating openmp "workers" whom each work on one thread. the reduction values are turned into local arrays and these threads then write into their own slots into these arrays. The reduction itself is then done as a final post step. It looks like the only thing we can vectorize is the post step. I wonder, since the compiler is the one introducing these local arrays, can we not mark them safe from inter dependencies?
[Bug target/114577] New: Inefficient codegen for SVE/NEON bridge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114577 Bug ID: 114577 Summary: Inefficient codegen for SVE/NEON bridge Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following sequence: #include svint32_t f (int *a, int *b) { int32x4_t va = vld1q_s32 (a); svint32_t za = svset_neonq_s32 (svundef_s32 (), va); return za; } -O2 -march=armv9-a is expected to be a simple load but generates: f: ldr q31, [x0] ptrue p3.s, vl4 sel z0.s, p3, z31.s, z0.s ret instead of the expected (from clang): f: // @f ldr q0, [x0] ret it looks like GCC's implementation of svset_neonq_s32 with svundef does not become a view_convert/subreg.
[Bug target/114510] [14 Regression] missed proping of multiply by 2 into address of load/stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114510 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #3 from Tamar Christina --- Richard's patch should fix this next year. https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634166.html We'll then stop relying on combine or other passes to fix this.
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=114575 --- Comment #10 from Tamar Christina --- This has also broken our addressing modes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114575
[Bug rtl-optimization/114575] New: [14 Regression] SVE addressing modes broken since g:839bc42772ba7af66af3bd16efed4a69511312ae
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114575 Bug ID: 114575 Summary: [14 Regression] SVE addressing modes broken since g:839bc42772ba7af66af3bd16efed4a69511312ae Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* Created attachment 57864 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57864=edit addr.cc Since the offending commit, compiling the example attached with -O3 -march=armv8.5-a+sve2 results in: .L14: lsl x0, x1, 1 add x1, x1, 8 add x2, x3, x0 add x6, x0, x4 ld1hz2.h, p7/z, [x2] ld1hz22.h, p7/z, [x6] add x0, x0, x5 fmadz22.h, p7/m, z21.h, z2.h ld1hz20.h, p7/z, [x0] fmadz20.h, p7/m, z26.h, z22.h st1hz20.h, p7, [x2] cmp x1, 808 bne .L14 instead of what it was before the commit: .L14: ld1hz2.h, p7/z, [x1, x0, lsl 1] ld1hz22.h, p7/z, [x2, x0, lsl 1] ld1hz20.h, p7/z, [x3, x0, lsl 1] fmadz22.h, p7/m, z21.h, z2.h fmadz20.h, p7/m, z26.h, z22.h st1hz20.h, p7, [x1, x0, lsl 1] add x0, x0, 8 cmp x0, 808 bne .L14 It's now no longer pushing in the register shifts. This causes significant performance loss as it needs to now perform the integer ALU ops before doing the load and they're on the critical path.
[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682 --- Comment #9 from Tamar Christina --- (In reply to Andrew Pinski from comment #8) > This might be the path splitting running on the gimple level causing issues > too; see PR 112402 . Ah that's a good shout. It looks like Richi already agrees that we should recognize/do some ifcvt at GIMPLE. Guess that just leaves the where.
[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682 Tamar Christina changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Component|middle-end |rtl-optimization Last reconfirmed||2024-04-02 --- Comment #7 from Tamar Christina --- Ok, so for now I think we should separate out the issue of whether/when we should perform the ifconversion and instead for a bit look at why we miss these conversions: The first problem is that GCC does not support if-converting blocks which terminate into another branch. In other words, the destination of 1 of the arms must have a single successor per noce_find_if_block which requires the final block to be a JOIN block. In size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) { auto pos = tree; size_t low = 0; size_t high = treeSize; while (low < high) { size_t i = (low + high) / 2; int res = cmp(user, tree[i]); // These should be cmp + csinc + csel on arm // and lea + test + cmov + cmov on x86. low = res > 0 ? i + 1 : low; // csinc high = res < 0 ? i: high; // csel if (res == 0) return i; } return -1; } this fails because of the: if (res == 0) return i; which means that the final block doesn't adhere to this. so the cbranch blocks the ifconversion. if we simplify the testcase: typedef unsigned long size_t; extern int cmp(size_t *); size_t bsearch2(size_t high, size_t low, size_t i) { while (low < high) { int res = cmp(); low = res > 0 ? i + 1 : low; high = res < 0 ? i: high; } return -1; } we now detect the csel but not the csinc. This has two reasons, we visit the BB in entry->exit order, but we probably should do exit->entry, such that when we if-convert blocks and delete the branches, the preceeding blocks become valid to if-convert. but the main reason the csinc is missed is because of code sinking. The update of the reference is sank into the res > 0 branch. [local count: 955630224]: # high_18 = PHI # low_19 = PHI res_11 = cmp (); if (res_11 > 0) goto ; [59.00%] else goto ; [41.00%] [local count: 563821836]: i.1_1 = i; iftmp.0_12 = i.1_1 + 1; goto ; [100.00%] This is because the compiler (righfully) decides that on the dereference should only happen in the branch that needs it. Because of this block 4 is no longer if-convertable. However since the load is to the stack we are allowed to move it around technically speaking. But this is only beneficial is the branch is really unpredictable. To show that it is the sinking, we can avoid it by using -fdisable-tree-sink2 and this example: size_t bsearch2(size_t high, size_t low, size_t *iv, int *resv) { while (low < high) { int res = cmp (iv); size_t i = *iv; low = res > 0 ? i + 1 : low; } return -1; } which generates the csinc. However the following example again doesn't generate the csinc for GCC but does for LLVM: size_t bsearch3(size_t high, size_t low, size_t i, int res) { while (low < high) { asm volatile ("" : "=w"(i)); asm volatile ("" : "=w"(res)); low = res > 0 ? i + 1 : low; } return -1; } which is because of the change header pass. GCC optimizes this scalar code from: size_t bsearch3 (size_t high, size_t low, size_t i, int res) { size_t iftmp.0_7; [local count: 59055799]: goto ; [100.00%] [local count: 1014686024]: __asm__ __volatile__("" : "=w" i_5); __asm__ __volatile__("" : "=w" res_6); if (res_6 > 0) goto ; [5.50%] else goto ; [94.50%] [local count: 55807731]: iftmp.0_7 = i_5 + 1; [local count: 114863530]: # low_1 = PHI [local count: 1073741824]: if (low_1 < high_3(D)) goto ; [94.50%] else goto ; [5.50%] [local count: 59055800]: return 18446744073709551615; } into size_t bsearch3 (size_t high, size_t low, size_t i, int res) { size_t iftmp.0_7; [local count: 59055799]: if (low_2(D) < high_3(D)) goto ; [48.59%] else goto ; [51.41%] [local count: 958878294]: __asm__ __volatile__("" : "=w" i_5); __asm__ __volatile__("" : "=w" res_6); if (res_6 > 0) goto ; [5.50%] else goto ; [94.50%] [local count: 55807731]: # i_10 = PHI iftmp.0_7 = i_10 + 1; if (high_3(D) > iftmp.0_7) goto ; [48.59%] else goto ; [51.41%] [local count: 55807730]: __asm__ __volatile__("" : "=w" i_8); __asm__ __volatile__("" : "=w" res_9); if (res_9 > 0) goto ; [5.50%] else goto ; [94.50%] [local count: 59055800]: return 18446744073709551615; } which then hits the limitation of the single pred I mentioned before. using -fdisable-tree-ch2 we get: add x2, x2, 1 cselx1,
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 --- Comment #20 from Tamar Christina --- This is a bad interaction with early break and peeling for gaps. when peeling for gaps we set bias_for_lowest to 0, which then negates the ceil for the upper bound calculation when the div is exact. We end up doing on a loop that does: Analyzing # of iterations of loop 1 exit condition [8, + , 18446744073709551615] != 0 bounds on difference of bases: -8 ... -8 result: # of iterations 8, bounded by 8 and a VF=4 calculating: Loop 1 iterates at most 1 times. Loop 1 likely iterates at most 1 times. Analyzing # of iterations of loop 1 exit condition [1, + , 1](no_overflow) < bnd.5505_39 bounds on difference of bases: 0 ... 4611686018427387902 Matching expression match.pd:2011, generic-match-8.cc:27 Applying pattern match.pd:2067, generic-match-1.cc:4813 result: # of iterations bnd.5505_39 + 18446744073709551615, bounded by 4611686018427387902 Estimating sizes for loop 1 ... Induction variable computation will be folded away. size: 2 if (ivtmp_312 < bnd.5505_39) Exit condition will be eliminated in last copy. size: 24-3, last_iteration: 24-5 Loop size: 24 Estimated size after unrolling: 26 ;; Guessed iterations of loop 1 is 0.858446. New upper bound 1. upper bound should be 2 not 1. I have a working patch, trying to create a standalone testcase for it.
[Bug tree-optimization/114403] [14 regression] LLVM miscompiled with -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114403 Tamar Christina changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-04-02 Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #19 from Tamar Christina --- Thanks! back from holidays and looking into it now. mine.
[Bug tree-optimization/114345] FRE missing knowledge of semantics of IFN loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345 --- Comment #5 from Tamar Christina --- (In reply to Richard Biener from comment #4) > Well, the shuffling in .LOAD_LANES will be a bit awkward to do, but sure. We > basically lack "constant folding" of .LOAD_LANES and similarly of course > we can't see through .STORE_LANES of a constant when later folding a scalar > load from the same memory. I guess it becomes harder with the 3 and 4 lane ones, but the 2 lanes one is just a single VEC_PERM_EXPR no?
[Bug target/114350] New: missing support for SVE widening floating point conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114350 Bug ID: 114350 Summary: missing support for SVE widening floating point conversion Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following example: #include svfloat64_t widening (svint32_t a) { svbool_t pred = svptrue_b32 (); svint64_t cvt = svreinterpret_s64_s32 (a); svint64_t ext = svextw_s64_x (pred, cvt); svfloat64_t res = svcvt_f64_s64_x (pred, ext); return res; } compiled with -Ofast -march=armv9-a generates: widening(__SVInt32_t): ptrue p3.b, all sxtwz0.d, p3/m, z0.d scvtf z0.d, p3/m, z0.d ret but SVE has widening and narrowing floating point conversions, as such this should generate: widening(__SVInt32_t): ptrue p3.b, all scvtf z0.d, p3/m, z0.s ret The autovec equivalent is: void f(int n, double *data) { for (int i=0;i
[Bug tree-optimization/114345] FRE missing knowledge of semantics of IFN loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345 --- Comment #3 from Tamar Christina --- (In reply to Andrew Pinski from comment #2) > Oh VN does have some knowledge of MASK_STORE and LEN_STORE. Just not > LOAD_LANES . > > > See PR 106365 for MASK_STORE and LEN_STORE implementation. Shouldn't be hard > to add LOAD_LANES/STORE_LANES there ... Ah!, thanks for the pointer.
[Bug tree-optimization/114346] New: vectorizer generates the same IV twice
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114346 Bug ID: 114346 Summary: vectorizer generates the same IV twice Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following example: --- double f(int n, double *data, double b) { double res = b; for (int i=0;i _48 = vect_vec_iv_.7_45 + { POLY_INT_CST [2, 2], ... }; _71 = VIEW_CONVERT_EXPR(vect_vec_iv_.7_45); _72 = VIEW_CONVERT_EXPR({ POLY_INT_CST [4, 4], ... }); _73 = _71 + _72; _49 = VIEW_CONVERT_EXPR(_73); so it looks like _48 and _49 are the same value, except that _48 is done as 32-bit IV and _49 is calculated as a 64-bit one and truncated to 32?
[Bug tree-optimization/114345] New: FRE missing knowledge of semantics of IFN loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345 Bug ID: 114345 Summary: FRE missing knowledge of semantics of IFN loads Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following testcase: --- long tdiff = 10412095; int main() { struct { long maximum; int nonprimary_delay; } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}}; for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i) if (tdiff <= delays[i].maximum) return delays[i].nonprimary_delay; __builtin_abort(); } --- compiled with -O2 -fno-vect-cost-model generates on AArch64: vect_cst__45 = {tdiff.0_2, tdiff.0_2}; vect_array.11 = .LOAD_LANES (MEM [(long int *)]); vect__1.12_40 = vect_array.11[0]; vect_array.11 ={v} {CLOBBER}; vect_array.14 = .LOAD_LANES (MEM [(long int *) + 32B]); vect__1.15_43 = vect_array.14[0]; vect_array.14 ={v} {CLOBBER}; mask_patt_15.17_46 = vect__1.12_40 >= vect_cst__45; mask_patt_15.17_47 = vect__1.15_43 >= vect_cst__45; vexit_reduc_51 = mask_patt_15.17_46 | mask_patt_15.17_47; and on x86_64: vect_cst__53 = {tdiff.0_2, tdiff.0_2}; _37 = { 0, 4294967295, 4294967294, 4294967293 }; _32 = { 4, 5, 6, 7 }; vect__1.11_42 = MEM [(long int *)]; vectp_delays.9_43 = + 16; vect__1.12_44 = MEM [(long int *)vectp_delays.9_43]; vect_perm_even_45 = VEC_PERM_EXPR ; vectp_delays.9_47 = + 32; vect__1.13_48 = MEM [(long int *)vectp_delays.9_47]; vectp_delays.9_49 = + 48; vect__1.14_50 = MEM [(long int *)vectp_delays.9_49]; vect_perm_even_51 = VEC_PERM_EXPR ; mask_patt_17.15_54 = vect_perm_even_45 >= vect_cst__53; mask_patt_17.15_55 = vect_perm_even_51 >= vect_cst__53; vexit_reduc_59 = mask_patt_17.15_54 | mask_patt_17.15_55; which is eventually simplified by FRE into: vect_cst__53 = {tdiff.0_2, tdiff.0_2}; mask_patt_17.15_54 = vect_cst__53 <= { 0, 0 }; mask_patt_17.15_55 = vect_cst__53 <= { 0, 9223372036854775807 }; vexit_reduc_59 = mask_patt_17.15_54 | mask_patt_17.15_55; and realizing that the loads aren't needed. It looks like the reason is that FRE doesn't understand LOAD_LANES and MASKED_LOAD_LANES or the other load IFNs. We thus end up with a spill to the stack and a load of the constants.
[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339 --- Comment #6 from Tamar Christina --- vectorizer generates: mask_patt_21.19_58 = vect_perm_even_49 >= vect_cst__57; mask_patt_21.19_59 = vect_perm_even_55 >= vect_cst__57; vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59; if (vexit_reduc_63 != { 0, 0 }) goto ; [20.00%] else goto ; [80.00%] This is changed at loopdone into: delays[3].nonprimary_delay = 129600; vect_cst__57 = {tdiff_6, tdiff_6}; mask_patt_21.19_58 = vect_cst__57 <= { 0, 0 }; mask_patt_21.19_59 = vect_cst__57 <= { 0, 0x7FFF }; vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59; if (vexit_reduc_63 != { 0, 0 }) goto ; [20.00%] else goto ; [80.00%] or in other words, if there's any value where the compare succeeds, find it and return. This looks correct to me. It could be that my AVX is rusty but, this generates: vmovdqa 0xf9c(%rip),%xmm1# 0x402010 mov$0x1,%eax vmovq %rcx,%xmm3 vmovdqa %xmm0,(%rsp) vpunpcklqdq %xmm3,%xmm3,%xmm2 vmovdqa %xmm0,0x10(%rsp) vpcmpgtq %xmm2,%xmm1,%xmm1# vmovdqa %xmm0,0x20(%rsp) vmovq %rax,%xmm0 vpunpcklqdq %xmm0,%xmm0,%xmm0 movl $0x1fa40,0x38(%rsp) vpcmpgtq %xmm2,%xmm0,%xmm0 vpor %xmm1,%xmm0,%xmm0 vptest %xmm0,%xmm0 which looks off, particularly for the second compare it look like it doesn't do a load but instead just duplicates the constant 1. gdb seems to confirm this. At the first compare: (gdb) p $xmm2.v2_int64 $4 = {10412095, 10412095} (gdb) p $xmm0.v2_int64 $5 = {0, 0} which is what's expected, but at the second compare: (gdb) p $xmm2.v2_int64 $7 = {10412095, 10412095} (gdb) p $xmm0.v2_int64 $6 = {1, 1} at the second it's comparing {1, 1} instead of {0, 0x7FFF}. on AArch64 where it doesn't fail the comparison is: moviv29.4s, 0 add x1, sp, 16 ldr x5, [x0, 8] mov w0, 64064 movkw0, 0x1, lsl 16 add x3, sp, 48 str q29, [sp, 64] mov x2, 57407 mov x4, 9223372036854775807 str x4, [sp, 64] movkx2, 0x9e, lsl 16 str w0, [sp, 72] sub x2, x2, x5 stp q29, q29, [x1] dup v27.2d, x2 ld2 {v30.2d - v31.2d}, [x1] str q29, [sp, 48] ld2 {v28.2d - v29.2d}, [x3] cmgev30.2d, v30.2d, v27.2d cmgev28.2d, v28.2d, v27.2d orr v30.16b, v30.16b, v28.16b umaxp v30.4s, v30.4s, v30.4s fmovx0, d30 cbnzx0, .L12 which has v30.2d being {0, 0} and v28.2d being {0, 0x7FFF} as expected... On AArch64 we don't inline the constants because whatever is propagating the constants can't understand the LOAD_LANES: mask_patt_19.21_50 = vect__2.16_44 >= vect_cst__49; mask_patt_19.21_51 = vect__2.19_47 >= vect_cst__49; vexit_reduc_55 = mask_patt_19.21_50 | mask_patt_19.21_51; if (vexit_reduc_55 != { 0, 0 }) goto ; [20.00%] else goto ; [80.00%] so could this be another expansion bug? Note that a simpler reproducer is this: --- long tdiff = 10412095; int main() { struct { long maximum; int nonprimary_delay; } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}}; for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i) if (tdiff <= delays[i].maximum) return delays[i].nonprimary_delay; __builtin_abort(); } --- the key point is that we're not allowed to constprop tdiff at GIMPLE. If we do, e.g: int main() { struct { long maximum; int nonprimary_delay; } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}}; long tdiff = 10412095; for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i) if (tdiff <= delays[i].maximum) return delays[i].nonprimary_delay; __builtin_abort(); } then after vectorization the const prop the entire expression is evaluated at GIMPLE and it gets the right result. This makes me believe it's a target expansion bug.
[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since r14-9193
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151 --- Comment #17 from Tamar Christina --- > So doing in the vectorizer sth like the following should get us the best > possible ranges? Ah, probably only global ranges since the SCEV query > itself would still lack context sensitive info (but as said we don't have > a good context we can easily use). > which reminds me.. I have been playing around with using ranger directly in the vectorizer. It looks like we get better ranges for the overwidening detection. It feels like using ranger.range_of_expr or similar would solve the problem that we don't currently get ranged for structural widening, i.e. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102188
[Bug tree-optimization/114234] [14 Regression] verify_ssa failure with early-break vectorisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114234 Tamar Christina changed: What|Removed |Added Last reconfirmed||2024-03-05 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #3 from Tamar Christina --- started with commit g:324d2907c86f05e40dc52d226940308f53a956c2 Author: Richard Biener Date: Mon Mar 4 09:46:13 2024 +0100 tree-optimization/114192 - scalar reduction kept live with early break vect The following fixes a missing replacement of the reduction value used in the epilog, causing the scalar reduction to be kept live across the early break exit path. PR tree-optimization/114192 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Use the appropriate def for the live out stmt in case of an alternate exit. the reduction seems to propagate out a single value for both exits: # a_39 = PHI but since the PHI nodes carry their reduction values in the exits we have different reduction statements, so using the same reduction value for all exit is wrong. should be # a_39 = PHI
[Bug tree-optimization/114192] scalar code left around following early break vectorization of reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192 Tamar Christina changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2024-03-01 --- Comment #1 from Tamar Christina --- Confirmed. It looks like DCE6 no longer thinks: # sum_10 = PHI _1 = aD.4432[i_12]; sum_7 = _1 + sum_11; is dead after vectorization. it removes the only dead consumer of sum_7, a PHI node left over in the guard block which becomes unused after the reduction is vectorized. DCE says: marking necessary through sum_11 stmt sum_11 = PHI processing: sum_11 = PHI marking necessary through sum_7 stmt sum_7 = _1 + sum_11; processing: sum_7 = _1 + sum_11; marking necessary through _1 stmt _1 = a[i_12]; processing: _1 = a[i_12]; so it thinks the closed definition is needed? This seems to only happen with reductions, other live operations look fine: extern int a[1024]; int f4(int *x, int n) { int sum = 0; for (int i = 0; i < n; i++) { sum = a[i]; if (a[i] == 42) break; } return sum; }
[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877 --- Comment #12 from Tamar Christina --- and it's not the first time we have conditional lowering. We already do so for e.g. shifts, where shifting by an amount => bitsize of a vector element is defined behavior or AArch64.
[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877 --- Comment #11 from Tamar Christina --- (In reply to Andrew Pinski from comment #10) > (In reply to Tamar Christina from comment #9) > > While RA should be able to deal with this, > > shouldn't we also just lower TBLs in gimple? > > > > This no reason why this can't be a VEC_PERM_EXPR which would also get the > > copies > > removed at the gimple level and allows us to optimize this to something else > > depending on the index. > > Yes there is a reason, `out of range` values for VEC_PERM is undefined while > tbl is well defined ( If an index is out of range for the table, the result > for that lookup is 0.). > I don't think that's not a good reason. The out of range values can be made implementation defined. i.e. mid-end shouldn't care about them. not lowering this in gimple means we lose a heck of a lot of optimizations that are impossible to cover in RTL.
[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since g:a0b1798042d033fd2cc2c806afbb77875dd2909b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151 --- Comment #3 from Tamar Christina --- > > This was a correctness fix btw, so I'm not sure we can easily recover - we > could try using niter information for CHREC_VARIABLE but then there's > variable niter here so I don't see a chance. > It looks like it's mostly SVE suffering here. Adv. SIMD an scalar codegen seems to have vastly improved with it. we see ~10% improvements due to better addressing for scalar. It also only happens at -O3 but -O2 is fine, which is weird as I was expected IVopts to be enabled at -O2 too. > > OTOH the +1 could make it overflow for large size. > > Can you test the above? It should be an incremental improvement. > Applying the changes does not seem to make a difference for the final codegen :(
[Bug tree-optimization/114151] New: [14 Regression] weird and inefficient codegen and addressing modes since g:a0b1798042d033fd2cc2c806afbb77875dd2909b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151 Bug ID: 114151 Summary: [14 Regression] weird and inefficient codegen and addressing modes since g:a0b1798042d033fd2cc2c806afbb77875dd2909b Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Target: aarch64* Created attachment 57559 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57559=edit testcase The attached C++ testcase compiled with: -O3 -mcpu=neoverse-n2 used to compile a nice and simple loop. But after g:a0b1798042d033fd2cc2c806afbb77875dd2909b The codegen is weird and it uses horrible addressing modes. The first odd part is that it's decided to split the loop, the "main" loop has a guard after it to branch to the exit is the iteration count is 1. If not instead of just loop again it falls through the a copy of the main loop, but has destroyed addressing modes. The copy of the loop seems to have unshared the address calculations. Before we had: _128 = (void *) ivtmp.11_20; _54 = MEM <__SVFloat16_t> [(__fp16 *)_128]; _10 = MEM <__SVFloat16_t> [(__fp16 *)_128 + POLY_INT_CST [16B, 16B]]; _75 = MEM <__SVFloat16_t> [(__fp16 *)_128 + POLY_INT_CST [32B, 32B]]; etc, so all as an offset from _128. Now we have: col_i_61 = (int) ivtmp.11_100; _60 = (long unsigned int) col_i_61; _59 = _60 * 2; _58 = a_j_69 + _59; _54 = MEM <__SVFloat16_t> [(__fp16 *)_58]; _53 = _59 + POLY_INT_CST [16, 16]; _13 = a_j_69 + _53; _10 = MEM <__SVFloat16_t> [(__fp16 *)_13]; _74 = _59 + POLY_INT_CST [32, 32]; _19 = a_j_69 + _74; _75 = MEM <__SVFloat16_t> [(__fp16 *)_19]; and similarly for the stores as well. it also weirdly creates some very complicated addressing computations. Before we had: _144 = p_mat_16(D) + 6; _64 = MEM <__SVFloat16_t> [(__fp16 *)_144 + ivtmp.10_100 * 2]; _143 = p_mat_16(D) + 4; _84 = MEM <__SVFloat16_t> [(__fp16 *)_143 + ivtmp.10_100 * 2]; and after: ivtmp.23_130 = (unsigned long) p_mat_16(D); _123 = 2 - ivtmp.23_130; _124 = <__SVFloat16_t> [(__fp16 *)0B + _123 + ivtmp.12_109 * 2]; _64 = MEM <__SVFloat16_t> [(__fp16 *)_124]; _122 = -ivtmp.23_130; _120 = <__SVFloat16_t> [(__fp16 *)0B + _122 + ivtmp.12_109 * 2]; _84 = MEM <__SVFloat16_t> [(__fp16 *)_120]; This results in quite the codesize increase, and a 7-10% performance loss.
[Bug target/98877] [AArch64] Inefficient code generated for tbl NEON intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98877 --- Comment #9 from Tamar Christina --- While RA should be able to deal with this, shouldn't we also just lower TBLs in gimple? This no reason why this can't be a VEC_PERM_EXPR which would also get the copies removed at the gimple level and allows us to optimize this to something else depending on the index.
[Bug target/102171] vget_low_*/vget_high_* intrinsics should become BIT_FIELD_REF during gimple
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102171 --- Comment #3 from Tamar Christina --- (In reply to Andrew Pinski from comment #2) > I think I am going to implement this (or assign it interally to someone else > to implement). If you do, please also remove them from arm_neon.h and use the new intrinsics framework. We're gradually trying to get this file empty.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 Tamar Christina changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #29 from Tamar Christina --- (In reply to rguent...@suse.de from comment #28) > On Mon, 26 Feb 2024, tnfchris at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 > > > > --- Comment #27 from Tamar Christina --- > > Created attachment 57538 [details] > > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57538=edit > > proposed1.patch > > > > proposed patch, this gets the gathers and scatters back. doing regression > > run. > > I don't think this will fly. Well.. I don't really know what the do here I guess. per the discussion on irc, we only used to try gather/scatters when SCEV fails. Now that it succeeds we no longer try using the pattern and try to handle it during vectorizable_load/vectorizable_stores as recognizing the gather/scatters inline through VMAT_GATHER_SCATTER. This works fine for normal gather and scatters but doesn't work for widening gathers and narrowing scatters which only the pattern seems to handle. I don't know how to get this to be detected through get_load_store_type since well, that's very late. among others we've already determined the VF and the unpacks have already been marked relevant. So vectorizable_load/vectorizable_store would have to actively change the IL. So I don't know how widening and narrowing operations are supposed to work here. given that.. I will leave it up to the maintainers I guess.
[Bug tree-optimization/86530] Vectorization failure for a simple loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86530 --- Comment #8 from Tamar Christina --- (In reply to Andrew Pinski from comment #6) > With my patch for V4QI, we still don't get the best code: > vect_perm_even_271 = VEC_PERM_EXPR 4, 6 }>; > vect_perm_even_273 = VEC_PERM_EXPR 4, 6 }>; > vect_perm_even_275 = VEC_PERM_EXPR vect_perm_even_273, { 0, 2, 4, 6 }>; > > _275={_264[0], _264[2], _268[0], _268[2]} or > VEC_PERM<_264, _268, {0, 2, 4, 6}> > > but for some reason we don't reduce it to that perm > > And there is still a lot of extra PERMS than there should be. Because this loop is not something that can be fixed by using V4QI (we tried before). This loop requires improvements to SCEV and SLP. It's loading 16 sequential bytes as there's no gap between the p1 and p2 values across iterations.. so this loop should vectorized with V16QI and widening additions. So I don't think this is related to the other example. So I'll take it back as it requires actual vectorizer work and part of things we're trying to address in GCC 15.
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 --- Comment #27 from Tamar Christina --- Created attachment 57538 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57538=edit proposed1.patch proposed patch, this gets the gathers and scatters back. doing regression run.
[Bug tree-optimization/114099] [14 regression] ICE in find_uses_to_rename_use when building darktable-4.6.1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114099 --- Comment #8 from Tamar Christina --- Created attachment 57537 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57537=edit uses.patch new code seems sensitive to visitation order as get_virtual_phi returns NULL for blocks which don't define or use a virtual PHI. testing patch.
[Bug middle-end/114081] [14 regression] ICE in verify_dominators when building php-8.3.3 (error: dominator of 16 should be 111, not 3) since r14-6822
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114081 Tamar Christina changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #6 from Tamar Christina --- slightly cleaned up testcase --- typedef struct filter_list_entry { const char *name; int id; void (*function)(); } filter_list_entry; static const filter_list_entry filter_list[9] = {0}; void php_zval_filter(int filter, int id1) { filter_list_entry filter_func; int size = 9; for (int i = 0; i < size; ++i) { if (filter_list[i].id == filter) { filter_func = filter_list[i]; goto done; } } #pragma GCC novector for (int i = 0; i < size; ++i) { if (filter_list[i].id == 0x0204) { filter_func = filter_list[i]; goto done; } } done: if (!filter_func.id) filter_func.function(); } --- seems to happen because it's doing an inner loop vect with two loops that have early exits to the same destination. if (multiple_exits_p) { update_loop = new_loop; doms = get_all_dominated_blocks (CDI_DOMINATORS, loop->header); for (unsigned i = 0; i < doms.length (); ++i) if (flow_bb_inside_loop_p (loop, doms[i])) doms.unordered_remove (i); } was supposed to update the dominators but something goes wrong. Mine, but for monday..
[Bug tree-optimization/114068] [14 regression] ICE when building darktable-4.6.1 (error: PHI node with wrong VUSE on edge from BB 25) since r14-8768
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114068 --- Comment #14 from Tamar Christina --- patch submitted https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646415.html
[Bug tree-optimization/114068] [14 regression] ICE when building darktable-4.6.1 (error: PHI node with wrong VUSE on edge from BB 25) since r14-8768
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114068 --- Comment #13 from Tamar Christina --- Created attachment 57510 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57510=edit candidate-patch1.patch candidate patch being tested. I was hoping to correct it during peeling itself when the merge block is created, but I don't have enough information there to do so.
[Bug tree-optimization/114068] [14 regression] ICE when building darktable-4.6.1 (error: PHI node with wrong VUSE on edge from BB 25) since r14-8768
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114068 --- Comment #12 from Tamar Christina --- looks like the moving of the store didn't update a stray out of block use of the MEM. working on patch.
[Bug tree-optimization/114068] [14 regression] ICE when building darktable-4.6.1 (error: PHI node with wrong VUSE on edge from BB 25) since r14-8768
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114068 Tamar Christina changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #11 from Tamar Christina --- reduced to --- struct h { int b; int f; } k; void n(int m) { struct h a = k; for (int o = m; o; ++o) { if (a.f) __builtin_unreachable(); if (o > 1) __builtin_unreachable(); *( + o) = 1; } } --- which fails on aarch64 too
[Bug target/114063] New: Use IFN_CHECK_RAW_PTRS/IFN_CHECK_WAR_PTRS for Advanced. SIMD
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114063 Bug ID: 114063 Summary: Use IFN_CHECK_RAW_PTRS/IFN_CHECK_WAR_PTRS for Advanced. SIMD Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following example: void fn (short *a, short *b, short *c, int n) { for (int i = 0; i < n; i += 2) { short b0 = b[i + 0]; short b1 = b[i + 1]; a[i + 0] = b0 + 2; a[i + 1] = b1 + 3; } } when compiled with -O3 -march=armv9-a will generate an alias check using whilewr which is the corresponding instruction for doing alias checks using CHECK_WAR_PTRS. But when compiling for a known CPU, such as -mcpu=neoveser-v2 the vectorizer chooses to use Adv. SIMD for the main loop, and so the check for supporting IFN_CHECK_WAR_PTRS fails as we only support SVE modes. Since all that really matters is the element size, I believe we can still use IFN_CHECK_WAR_PTRS for Adv. SIMD using the SVE instruction if SVE is allowed. That is, we should implement the pattern for Advanced SIMD modes as well gated on TARGET_SVE.
[Bug tree-optimization/114061] GCC fails vectorization when using __builtin_prefetch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114061 --- Comment #4 from Tamar Christina --- (In reply to Andrew Pinski from comment #3) > Confirmed. > > Though maybe we should drop them in the vectorized version of the loop. HW > prefetchers usually do a decent job and sometimes (maybe most) SW hinted > prefetches interfere with the HW prefetchers. definitely agree that I'm not sure how useful they are, but some customers definitely seem to want them.
[Bug tree-optimization/114061] GCC fails vectorization when using __builtin_prefetch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114061 --- Comment #2 from Tamar Christina --- (In reply to Andrew Pinski from comment #1) > I thought there was already one recorded about this. I could only find https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103938 about an ICE when prefetching a vector address.
[Bug tree-optimization/114061] New: GCC fails vectorization when using __builtin_prefetch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114061 Bug ID: 114061 Summary: GCC fails vectorization when using __builtin_prefetch Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following example: void foo(double * restrict a, double * restrict b, int n){ int i; for(i=0; i
[Bug target/113257] -march=native or -mcpu=native are ineffective, but -march=native -mcpu=native works on arm64 M2 Ultra
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113257 --- Comment #5 from Tamar Christina --- (In reply to Sam James from comment #3) > (In reply to Richard Earnshaw from comment #2) > I'm missing why the combination then works though? So we've made several changes here over time. -mcpu=native does attempt to find the core id to know what it is, but since GCC-12 we also enable extensions that don't have a requirement on an architecture. So for instance on an SVE capable core we would enable it, but we of course can't enable a tuning model. But between GCC 13 and GCC 14 many things have changes. It looks like GCC-14 works as expected: > ./install/bin/gcc -xc -S -o - - -march=native < /dev/null .arch armv8-a+flagm+dotprod+rdma+lse+crc+aes+sha3+fp16fml+rcpc+i8mm+bf16+sb+ssbs+pauth .file "" .text .ident "GCC: (GNU) 14.0.1 20240129 (experimental)" .section .note.GNU-stack,"",@progbits > ./install/bin/gcc -xc -S -o - - -mcpu=native < /dev/null .arch armv8-a+flagm+dotprod+rdma+lse+crc+aes+sha3+fp16fml+rcpc+i8mm+bf16+sb+ssbs+pauth .file "" .text .ident "GCC: (GNU) 14.0.1 20240129 (experimental)" .section .note.GNU-stack,"",@progbits > ./install/bin/gcc -xc -S -o - - -mcpu=native -march=native < /dev/null .arch armv8-a+flagm+dotprod+rdma+lse+crc+aes+sha3+fp16fml+rcpc+i8mm+bf16+sb+ssbs+pauth .file "" .text .ident "GCC: (GNU) 14.0.1 20240129 (experimental)" .section .note.GNU-stack,"",@progbits So first question is, can you confirm it does for GCC-14 for you too? In the meantime I'll try GCC-13
[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 Tamar Christina changed: What|Removed |Added Ever confirmed|0 |1 Summary|[14 Regression] Fail to |[14 Regression] Fail to |fold the last element with |fold the last element with |multiple loop |multiple loop since ||g:2efe3a7de0107618397264017 ||fb045f237764cc7 Last reconfirmed||2024-02-22 Status|UNCONFIRMED |NEW Keywords|needs-bisection | --- Comment #26 from Tamar Christina --- (In reply to Richard Biener from comment #18) > (In reply to Tamar Christina from comment #17) > > Ok, bisected to > > > > g:2efe3a7de0107618397264017fb045f237764cc7 is the first bad commit > > commit 2efe3a7de0107618397264017fb045f237764cc7 > > Author: Hao Liu > > Date: Wed Dec 6 14:52:19 2023 +0800 > > > > tree-optimization/112774: extend the SCEV CHREC tree with a nonwrapping > > flag > > > > Before this commit we were unable to analyse the stride of the access. > > After this niters seems to estimate the loop trip count at 4 and after that > > the logs diverge enormously. > > Hum, but that's backward and would match to what I said in comment#2 - we > should get better code with that. > Ok, so I've dug more into this today. It's definitely this commit that's causing it. The reason is we no longer consider masked gather/scatters. Before this commit we the gather pattern would trigger: tresg.i:3:275: note: gather/scatter pattern: detected: a[_2] = b.3_3; tresg.i:3:275: note: gather_scatter pattern recognized: .SCATTER_STORE ((sizetype) , _2, 4, b.3_3); and the use of the masked scatter is what's causing the epilogue to not be required and why it generates better code. It's not the loads. The issue is that vect_analyze_data_refs only considers gather/scatters IF DR analysis fails, which it did before: tresg.c:31:29: missed: failed: evolution of offset is not affine. base_address: offset from base address: constant offset from base address: step: base alignment: 0 base misalignment: 0 offset alignment: 0 step alignment: 0 base_object: array1 Access function 0: {{m_112 * 2, +, 24}_3, +, 2}_4 Access function 1: 0 Creating dr for array1[0][_8] this now succeeds after the quoted commit: success. base_address: offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2) constant offset from base address: 0 step: 4 base alignment: 8 base misalignment: 0 offset alignment: 4 step alignment: 4 base_object: array1 Access function 0: {{m_112 * 2, +, 24}_3, +, 2}_4 Access function 1: 0 Creating dr for array1[0][_8] so we never enter /* Check that analysis of the data-ref succeeded. */ if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr) || !DR_STEP (dr)) { and without the IFN scatters it tries deinterleaving scalar stores to scatters: tresg.c:29:22: note: Detected single element interleaving array1[0][_8] step 4 tresg.c:29:22: note: Detected single element interleaving array1[1][_8] step 4 tresg.c:29:22: note: Detected single element interleaving array1[2][_8] step 4 tresg.c:29:22: note: Detected single element interleaving array1[3][_8] step 4 tresg.c:29:22: note: Detected single element interleaving array1[0][_1] step 4 tresg.c:29:22: note: Detected single element interleaving array1[1][_1] step 4 tresg.c:29:22: note: Detected single element interleaving array1[2][_1] step 4 tresg.c:29:22: note: Detected single element interleaving array1[3][_1] step 4 tresg.c:29:22: missed: not consecutive access array2[_4][_8] = _70; tresg.c:29:22: note: using strided accesses tresg.c:29:22: missed: not consecutive access array2[_4][_1] = _68; tresg.c:29:22: note: using strided accesses ... tresg.c:29:22: note: using gather/scatter for strided/grouped access, scale = 2 but without the SCATTER_STORE IFN it never tries masking the scatter, so we lose MASK_SCATTER_STORE and hence we generate worse code because the whole loop can no longer be predicated However trying to force it generates an ICE so I guess it's not that simple.
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -mcpu=native since g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 Tamar Christina changed: What|Removed |Added Keywords|needs-bisection | Summary|[14 Regression] SPEC 2006 |[14 Regression] SPEC 2006 |416.gamess miscompares on |416.gamess miscompares on |Aarch64 when built with |Aarch64 when built with |-Ofast -march=native -flto |-Ofast -mcpu=native since ||g:2f46e3578d45ff060a0a329cb ||39d4f52878f9d5a --- Comment #4 from Tamar Christina --- most of the changes seem to be scheduling and register renaming, so I guess one should compile with scheduling off to reduce the noise a bit. detinp_ does have a weird transformation where a cluster of csel go missing and replaced with mov, but the function is too big for me to quickly tell if this is the issue.
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -march=native -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 --- Comment #3 from Tamar Christina --- I'm however able to reproduce it at -Ofast alone, no need for `-flto`
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -march=native -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 --- Comment #2 from Tamar Christina --- bisected to commit g:2f46e3578d45ff060a0a329cb39d4f52878f9d5a Author: Richard Sandiford Date: Thu Dec 14 13:46:16 2023 + aarch64: Improve handling of accumulators in early-ra Being very simplistic, early-ra just models an allocno's live range as a single interval. This doesn't work well for single-register accumulators that are updated multiple times in a loop, since in and it still seems to be miscomparing today.
[Bug target/113295] [14 Regression] SPEC 2006 416.gamess miscompares on Aarch64 when built with -Ofast -march=native -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113295 Tamar Christina changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2024-02-19 Ever confirmed|0 |1 Priority|P3 |P1 CC||tnfchris at gcc dot gnu.org --- Comment #1 from Tamar Christina --- Ah, I missed this. Yeah we've seen it but didn't have time to track down untill now. however our CI shows it showed up between g:ae034b9106fbdd855ec22ce221bb61a1a9a532c3 and g:a064e4925afa1ad5f2f8c1350c4f57d631ce and g:34d339bbd0c1f5b4ad9587e7ae8387c912cb028b is a target only change so unlikely to be related. bisecting.
[Bug middle-end/111156] [14 Regression] aarch64 aarch64/sve/mask_struct_store_4.c failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56 --- Comment #21 from Tamar Christina --- (In reply to Richard Biener from comment #18) > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 7cf9504398c..8deeecfd4aa 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -1280,8 +1280,11 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char > *swap, > && rhs_code.is_tree_code () > && (TREE_CODE_CLASS (tree_code (first_stmt_code)) > == tcc_comparison) > - && (swap_tree_comparison (tree_code (first_stmt_code)) > - == tree_code (rhs_code))) > + && ((swap_tree_comparison (tree_code (first_stmt_code)) > +== tree_code (rhs_code)) > + || ((TREE_CODE_CLASS (tree_code (alt_stmt_code)) > +== tcc_comparison) > + && rhs_code == alt_stmt_code))) >&& !(STMT_VINFO_GROUPED_ACCESS (stmt_info) > && (first_stmt_code == ARRAY_REF > || first_stmt_code == BIT_FIELD_REF > > should get you SLP but: > > t.c:8:26: note: === vect_slp_analyze_operations === > t.c:8:26: note: ==> examining statement: pretmp_29 = *_28; > t.c:8:26: missed: unsupported load permutation > t.c:10:30: missed: not vectorized: relevant stmt not supported: pretmp_29 > = *_28; > > t.c:8:26: note: op template: pretmp_29 = *_28; > t.c:8:26: note: stmt 0 pretmp_29 = *_28; > t.c:8:26: note: stmt 1 pretmp_29 = *_28; > t.c:8:26: note: load permutation { 0 0 } hmm with that applied I get: sve-mis.c:8:26: note: ==> examining statement: pretmp_29 = *_28; sve-mis.c:8:26: note: Vectorizing an unaligned access. sve-mis.c:8:26: note: vect_model_load_cost: unaligned supported by hardware. sve-mis.c:8:26: note: vect_model_load_cost: inside_cost = 1, prologue_cost = 0 . but it bails out at: sve-mis.c:8:26: missed: Not using elementwise accesses due to variable vectorization factor. sve-mis.c:10:25: missed: not vectorized: relevant stmt not supported: .MASK_STORE (_5, 8B, _27, pretmp_29); sve-mis.c:8:26: missed: bad operation or unsupported loop bound. for me
[Bug middle-end/111156] [14 Regression] aarch64 aarch64/sve/mask_struct_store_4.c failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56 --- Comment #15 from Tamar Christina --- and just -O3 -march=armv8-a+sve
[Bug middle-end/111156] [14 Regression] aarch64 aarch64/sve/mask_struct_store_4.c failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56 --- Comment #14 from Tamar Christina --- (In reply to Richard Biener from comment #13) > I didn't add STMT_VINFO_SLP_VECT_ONLY, I'm quite sure we can now do both SLP > of masked loads and stores, so yes, STMT_VINFO_SLP_VECT_ONLY (when we formed > a DR group of stmts we cannot combine without SLP as the masks are not equal) > should be set for both loads and stores. > > The can_group_stmts_p checks as present seem correct here (but the dump > should not say "Load" but maybe "Access") I guess I'm wondering because of this usage: /* Check that the data-refs have same first location (except init) and they are both either store or load (not load and store, not masked loads or stores). */ if (DR_IS_READ (dra) != DR_IS_READ (drb) || data_ref_compare_tree (DR_BASE_ADDRESS (dra), DR_BASE_ADDRESS (drb)) != 0 || data_ref_compare_tree (DR_OFFSET (dra), DR_OFFSET (drb)) != 0 || !can_group_stmts_p (stmtinfo_a, stmtinfo_b, true)) break; We don't exit there now for non-SLP. > > So what's the testcase comment#9 talks about? You should be able to reproduce it with: --- typedef __SIZE_TYPE__ size_t; typedef signed char int8_t; typedef unsigned short uint16_t ; void __attribute__((noinline, noclone)) test_i8_i8_i16_2(int8_t *__restrict dest, int8_t *__restrict src, uint16_t *__restrict cond, size_t n) { for (size_t i = 0; i < n; ++i) { if (cond[i] < 8) dest[i * 2] = src[i]; if (cond[i] > 2) dest[i * 2 + 1] = src[i]; } } void __attribute__((noinline, noclone)) test_i8_i8_i16_2_1(volatile int8_t * dest, volatile int8_t * src, volatile uint16_t * cond, size_t n) { #pragma GCC novector for (size_t i = 0; i < n; ++i) { if (cond[i] < 8) dest[i * 2] = src[i]; if (cond[i] > 2) dest[i * 2 + 1] = src[i]; } } #define size 16 int8_t srcarray[size]; uint16_t maskarray[size]; int8_t destarray[size*2]; int8_t destarray1[size*2]; int main() { #pragma GCC novector for(int i = 0; i < size; i++) { maskarray[i] = i == 10 ? 0 : (i == 5 ? 9 : (2*i) & 0xff); srcarray[i] = i; } #pragma GCC novector for(int i = 0; i < size*2; i++) { destarray[i] = i; destarray1[i] = i; } test_i8_i8_i16_2(destarray, srcarray, maskarray, size); test_i8_i8_i16_2_1(destarray1, srcarray, maskarray, size); #pragma GCC novector for(int i = 0; i < size*2; i++) { if (destarray[i] != destarray1[i]) __builtin_abort(); } } --- since really only one of the functions needs to vectorize.
[Bug tree-optimization/112376] [14 Regression] gcc.dg/tree-ssa/ssa-dom-thread-7.c missed threading in aarch64 case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112376 Tamar Christina changed: What|Removed |Added Last reconfirmed||2024-02-15 Summary|[14 Regression] |[14 Regression] |gcc.dg/tree-ssa/ssa-dom-thr |gcc.dg/tree-ssa/ssa-dom-thr |ead-7.c was not adjusted|ead-7.c missed threading in |for aarch64 case|aarch64 case Ever confirmed|0 |1 Keywords||missed-optimization Component|testsuite |tree-optimization Status|UNCONFIRMED |NEW CC||amacleod at redhat dot com, ||tnfchris at gcc dot gnu.org --- Comment #1 from Tamar Christina --- This is a jumpthreading regression caused by: commit g:0cfc9c953d0221ec3971a25e6509ebe1041f142e Author: Andrew MacLeod Date: Thu Aug 17 12:34:59 2023 -0400 Phi analyzer - Initialize with range instead of a tree. Rangers PHI analyzer currently only allows a single initializer to a group. This patch changes that to use an inialization range, which is cumulative of all integer constants, plus a single symbolic value. There is no other change to group functionality. This patch also changes the way PHI groups are printed so they show up in the listing as they are encountered, rather than as a list at the end. It was more difficult to see what was going on previously. We seem to miss one thread that we did previously in the testcase. The new code does look worse. Jumpthreading is outside my wheelhouse for now, so any Ideas Andrew?
[Bug middle-end/111156] [14 Regression] aarch64 aarch64/sve/mask_struct_store_4.c failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56 Tamar Christina changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #12 from Tamar Christina --- The commit that caused it is: commit g:a1558e9ad856938f165f838733955b331ebbec09 Author: Richard Biener Date: Wed Aug 23 14:28:26 2023 +0200 tree-optimization/15 - SLP of masked stores The following adds the capability to do SLP on .MASK_STORE, I do not plan to add interleaving support. specifically this change: diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index 3e9a284666c..a2caf6cb1c7 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -3048,8 +3048,7 @@ can_group_stmts_p (stmt_vec_info stmt1_info, stmt_vec_info stmt2_info, like those created by build_mask_conversion. */ tree mask1 = gimple_call_arg (call1, 2); tree mask2 = gimple_call_arg (call2, 2); - if (!operand_equal_p (mask1, mask2, 0) - && (ifn == IFN_MASK_STORE || !allow_slp_p)) + if (!operand_equal_p (mask1, mask2, 0) && !allow_slp_p) { mask1 = strip_conversion (mask1); if (!mask1) With the change it now incorrectly thinks that the two masks (a <=7, a > 2) are the same which is why one of the masks go missing. Part of it is that the boolean is used in a weird way. During vect_analyze_data_ref_accesses where this difference is important we pass true in the initial check. but the || before made it so that we checked the MASK_STOREs still. Now it means during analysis we never check. later on in the same method we check it again but with false as the argument for determining STMT_VINFO_SLP_VECT_ONLY. The debug statement there is weird btw, as it says: if (dump_enabled_p () && STMT_VINFO_SLP_VECT_ONLY (stmtinfo_a)) dump_printf_loc (MSG_NOTE, vect_location, "Load suitable for SLP vectorization only.\n"); but as far as I can see, stmtinfo_a can be a store too, based on the checks for DR_IS_READ (dra) just slightly higher up. The patch that added this check (g:997636716c5dde7d59d026726a6f58918069f122) says it's because the vectorizer doesn't support SLP of masked loads, and I can't tell if we do now. If we do, the boolean should be dropped.. if we don't, we probably need the check back to allow the check for stores. It looks like this check us being used to disable STMT_VINFO_SLP_VECT_ONLY for loads, which is a bit counter intuitive and feels like a hack rather than just doing: STMT_VINFO_SLP_VECT_ONLY (stmtinfo_a) - = !can_group_stmts_p (stmtinfo_a, stmtinfo_b, false); + = !can_group_stmts_p (stmtinfo_a, stmtinfo_b) + && DR_IS_WRITE (dra); So I think the boolean should be dropped and just reject loads for STMT_VINFO_SLP_VECT_ONLY... This also seems to give much better codegen... in any case, richi?
[Bug fortran/107071] gfortran.dg/ieee/modes_1.f90 fails on aarch64-linux
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107071 Tamar Christina changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #12 from Tamar Christina --- Indeed, should we just xfail it on AArch64?
[Bug rtl-optimization/113903] sched1 should schedule across EBBS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903 --- Comment #2 from Tamar Christina --- (In reply to Alexander Monakov from comment #1) > Lifting those insns from the L8 BB to the L10 BB requires duplicating them > on all incoming edges targeting L8, doesn't it? > No, because they're unused before L10. If they are used then they can't be moved. (note that L10 is only reachable from L8 as it's a branch in the loop). > Why is decreasing live ranges important here? two reasons, first we have to avoid prematurely creating the copies. The loop has multiple exits, and the values are not relevant for all exits. mov z29.d, z31.d mov z27.d, z30.d is being done because we increment the inductions in the same basic block. But the incremented value is not needed in L8. for loop induction variables I suppose we can change the materialization point in the vectorizer to deal with them that way, but that only takes care of inductions and ideally we shouldn't perform operations before an exit if it's not needed for that exit. At the moment the vectorizer only deals with moving statements that are needed for correctness.
[Bug tree-optimization/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 Tamar Christina changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #28 from Tamar Christina --- Fixed, thanks for the report and all the help reducing it!
[Bug rtl-optimization/113903] New: sched1 should schedule across EBBS
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113903 Bug ID: 113903 Summary: sched1 should schedule across EBBS Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following testcase: #define N 306 #define NEEDLE 136 int table[N]; int foo (int i, unsigned short parse_tables_n) { parse_tables_n >>= 9; parse_tables_n += 11; while (i < N && parse_tables_n--) table[i++] = 0; return table[NEEDLE]; } compiled at -O3 shows an issue we've started getting with the support for early break vectorization. sched1 doesn't seem to be able to schedule across EBBs, which is logical since we never really needed to before. However the above code generates: .L10: st1wz28.s, p7, [x1, #1, mul vl] st1wz28.s, p7, [x1] add x1, x1, x5 cmp w0, w2 bcc .L17 .L8: cmpne p15.h, p7/z, z31.h, #0 mov z29.d, z31.d not p15.b, p14/z, p15.b mov z27.d, z30.d add w2, w2, w4 dechz31.h ptest p14, p15.b incwz30.s, all, mul #2 b.none .L10 umovw1, v29.h[0] umovw20, v27.s[0] and w3, w1, 65535 b .L6 and the AArch64 codegen inefficiencies aside (which I will tackle myself) shows that we're copying the old value of the induction variables in every loop iteration to keep them for the reductions if we exit. However the new values are not live in L8 and so the operations can be moved to L10: .L10: incwz30.s, all, mul #2 dechz31.h st1wz28.s, p7, [x1, #1, mul vl] st1wz28.s, p7, [x1] add x1, x1, x5 cmp w0, w2 bcc .L17 .L8: cmpne p15.h, p7/z, z31.h, #0 not p15.b, p14/z, p15.b add w2, w2, w4 ptest p14, p15.b b.none .L10 umovw1, v31.h[0] umovw20, v30.s[0] and w3, w1, 65535 b .L6 and thus decreasing the live ranges. The optimal codegen for this sequence is: .L10: dechz31.h incwz30.s, all, mul #2 st1wz28.s, p7, [x1, #1, mul vl] st1wz28.s, p7, [x1] add x1, x1, x5 cmp w0, w2 bcc .L17 .L8: cmpeq p15.h, p7/z, z31.h, #0 add w2, w2, w4 b.none .L10 umovw1, v31.h[0] umovw20, v30.s[0] b .L6
[Bug tree-optimization/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 Tamar Christina changed: What|Removed |Added Component|middle-end |tree-optimization Priority|P3 |P1
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #24 from Tamar Christina --- The case I thought would go wrong with the above fix is: #include #include #include #define N 306 #define NEEDLE 135 __attribute__ ((noipa, noinline)) int use(int x[N]) { printf("res=%d\n", x[NEEDLE]); return x[NEEDLE]; } __attribute__ ((noipa, noinline)) int foo (int i, unsigned short parse_tables_n) { int table[N]; memset (table, -1, sizeof (table)); parse_tables_n >>= 9; parse_tables_n += 9; while (i < N && parse_tables_n--) table[i++] = 0; return use (table); } int main () { if (foo (0, 0x) != 0) abort (); return 0; } --- but this seems fine because of the bias_for_lowest which I now understand to be there to account for this. So starting a regtest for that patch.
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #23 from Tamar Christina --- small standalone reducer: #include #include #include #define N 306 #define NEEDLE 136 __attribute__ ((noipa, noinline)) int use(int x[N]) { printf("res=%d\n", x[NEEDLE]); return x[NEEDLE]; } __attribute__ ((noipa, noinline)) int foo (int i, unsigned short parse_tables_n) { int table[N]; memset (table, -1, sizeof (table)); parse_tables_n >>= 9; parse_tables_n += 11; while (i < N && parse_tables_n--) table[i++] = 0; return use (table); } int main () { if (foo (0, 0x) != 0) abort (); return 0; } --- > gcc incorrect.c -O3 -o incorrect.exe; and ./incorrect.exe; echo $status res=-1 1 > gcc incorrect.c -O1 -o incorrect.exe; and ./incorrect.exe; echo $status res=0 0 > gcc incorrect.c -O3 -fdisable-tree-cunroll -o incorrect.exe; and > ./incorrect.exe; echo $status res=0 0
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #22 from Tamar Christina --- (In reply to Richard Biener from comment #21) > loop->nb_iterations_upper_bound exactly is an upper bound on the number of > latch executions, so maybe I'm missing the point here. When we update it it > as > well has to reflect an upper bound on that, whether the last exit (the one > before the latch) is the IV exit or a vectorized early exit. Yes, but the issue here is that the bounds limit is coming from an exit other than the one being chosen as the IV exit. So the latch iterates X times not X - 1. > > But yes, if the last exit is an early one that last iteration might be > partial > (so we drop the -1), but that's what we already do? I don't see where. This code all seems to remove -1 from the iteration count and assuming the latch iteration count + 1 == nbounds. I don't really think this is about partial loops at all. Change parse_tables_n >>= 9; parse_tables_n += 11; to parse_tables_n >>= 9; parse_tables_n += 10; then loop->nb_iterations_upper_bound is 136 and there is no partial iterations here. You do 17 full vector iterations with no residuals.
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #20 from Tamar Christina --- [local count: 21718864]: ... _54 = (short unsigned int) bits_106; _26 = _54 >> 9; _88 = _139 + 7; _89 = _88 & 7; _111 = _26 + 10; [local count: 181308616]: # i_66 = PHI # parse_tables_n_lsm.141_87 = PHI <_29(36), _111(21)> i_69 = i_66 + 1; table[i_66] = 0; _29 = parse_tables_n_lsm.141_87 + 65535; if (parse_tables_n_lsm.141_87 != 0) goto ; [94.50%] else goto ; [5.50%] [local count: 171336643]: if (i_69 != 306) goto ; [93.84%] else goto ; [6.16%] [local count: 160784290]: goto ; [100.00%] is the relevant part of the loop. At the start of vect we determine: misc.c:147:31: note:All loop exits successfully analyzed. misc.c:147:31: note: Symbolic number of iterations is 306 - (unsigned int) i_146 Creating dr for table[i_66] but when we get to vect_transform_loop we've updated loop->nb_iterations_upper_bound to be 137. >>> p (int)loop->nb_iterations_upper_bound $2 = 137 SCEV claims: result: # of iterations _26 + 10, bounded by 137 (instantiate_scev (instantiate_below = 21 -> 22) (evolution_loop = 4) (chrec = _26 + 10) (res = _26 + 10)) Statement (exit)if (parse_tables_n_lsm.141_87 != 0) is executed at most _26 + 10 (bounded by 137) + 1 times in loop 4. and further down Statement table[i_66] = 0; is executed at most 305 (bounded by 305) + 1 times in loop 4. So it looks like we determine the latch will execute a maximum of 305 times, but that the early exit is only executed 137 times. Which means the loop is bounded by 137 iterations. This looks like it's because _111 is unsigned short, so (0x >> 9) + 10 == 137 This is fine, but when during vect_transform_loop when we update the bounds we do: wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest, lowest_vf) - 1); so we end up doing ((137 + 1) / 8) - 1 which is bogus. This results in 16 iterations while we know the vector loop must do 17. The additional -1 is because the code assumes that the niters bounds is coming from the latch iteration count, however in this case it's coming from another exit and the latch never executes. I think for early exits we should take the ceil instead, since we have cases like this where an early exit restricts the loop's iteration count, but is not choosen as the main exit. diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 854e9d78bc7..0b1656fef2f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12171,7 +12171,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* True if the final iteration might not handle a full vector's worth of scalar iterations. */ bool final_iter_may_be_partial -= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo); += LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) + || LOOP_VINFO_EARLY_BREAKS (loop_vinfo); /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */ Fixes it, but this doesn't feel entirely right to me. Because in particular it would have still been wrong if the nbounds on (parse_tables_n_lsm.141_87 != 0) was 136, since there's no decimal after the divide there we'd end up at 16 again. So I think we need to know whether the bounds is coming from the loop exit or an alternate exit and not decrement by -1 if the bounds limit does not come from the latch.
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #19 from Tamar Christina --- Ok, removing all the noise shows that this is the same issue as I saw before. The code out of the vectorizer is correct, but cunroll does a dodgee unrolling. -fdisable-tree-cunroll confirms it's the unrolling. cunroll claims: Loop 4 iterates at most 16 times. Loop 4 likely iterates at most 16 times. Analyzing # of iterations of loop 4 exit condition [1, + , 1](no_overflow) < bnd.157_285 ... ;; Guessed iterations of loop 4 is 7.347979. New upper bound 16. Making edge 23->36 impossible by redistributing probability to other edges. Original probability: 93.8% (guessed) misc.c:147:31: optimized: loop with 16 iterations completely unrolled (header execution count 2859449) Last iteration exit edge was proved true. Not peeling: number of iterations is not estimated but the i is bounded by i < 306. and VF=8. so binding loop iteration to 16 means it just cut off half the loop. so it looks like the upper bounds is wrong. Checking what we write out during loop vect.
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #18 from Tamar Christina --- Loop that gets miscompiled is the initialization loop: while (parse_tables_n-- && i < 306) table[i++] = 0; and indeed, the compiler seems to also be ignoring the #pragma GCC novector attribute on it. Will take a look at both.
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #17 from Tamar Christina --- (In reply to Sam James from comment #16) > Created attachment 57393 [details] > test.c > > OK, all done now (I figured I'd let cvise finish). No more :) > > By the way, this fails on arm64 too (at least the reduced thing). Thanks! yeah I'm already debugging there :) I have a much better setup on aarch64 to track such things down. Thanks again for the reducers!
[Bug middle-end/113734] [14 regression] libarchive miscompiled (fails libarchive_test_read_format_rar5_extra_field_version test) since r14-8768-g85094e2aa6dba7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113734 --- Comment #15 from Tamar Christina --- (In reply to Sam James from comment #14) > Created attachment 57390 [details] > test.c > > I'll try reducing it preprocessed now (couldn't do it before as checking w/ > clang as well in the reduction script to avoid introducing UB). Thanks! this is good enough. Thanks for the reducer!
[Bug tree-optimization/113808] [14 Regression] FAIL: libgomp.fortran/non-rectangular-loop-1.f90 since r14-8768
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113808 Tamar Christina changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #11 from Tamar Christina --- Fixed thanks.
[Bug tree-optimization/113808] [14 Regression] FAIL: libgomp.fortran/non-rectangular-loop-1.f90 since r14-8768
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113808 Tamar Christina changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #9 from Tamar Christina --- (In reply to Richard Biener from comment #6) > With the following I don't see things going wrong, but we end up with the > loop > having the STOP exit last instead and thus a PEELED case. If it's not a PEELED case than the code is wrong indeed. _100 = BIT_FIELD_REF ; k.4_43 = _100; is wrong since for a normal case the primary exit needs to do a last reduction rather than a first. _109 = BIT_FIELD_REF ; _48 = _109; _100 = BIT_FIELD_REF ; k.4_43 = _100; these two reduction orders should never be different. The bug seems to be in vectorizable_live_operations where we determine if the index needs to be a first or last reduction. There's a boolean there restart_loop = restart_loop || !main_exit_edge; and we initially set it to bool restart_loop = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); outside the USE/DEF loop. The problem is this depends on seeing the uses for the LOOP_VINFO_IV_EXIT before seeing that of the early exits. The code goes wrong because we see the early exit first and then see the main exit, but once true the boolean can't become false again. it's a silly bug, the boolean shouldn't be cached between loop iters. quick hack: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 190df9ec774..109a7e16abb 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10966,7 +10966,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, /* For early exit where the exit is not in the BB that leads to the latch then we're restarting the iteration in the scalar loop. So get the first live value. */ - restart_loop = restart_loop || !main_exit_edge; + restart_loop = !main_exit_edge; if (restart_loop && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def) { works but will revisit this and fix properly now. Thanks for the reduction.
[Bug tree-optimization/113808] [14 Regression] FAIL: libgomp.fortran/non-rectangular-loop-1.f90
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113808 --- Comment #2 from Tamar Christina --- I guess whether that code is correct depends on which exit was picked though. I'll look at dump too.
[Bug tree-optimization/113750] [14 Regression] ICE in vect building gcc/m2/gm2-libs/NumberIO.mod since r14-8769-g64b0130bb6702c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113750 Tamar Christina changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #8 from Tamar Christina --- Fixed.