[Bug c/115104] RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115104 --- Comment #2 from Robin Dapp --- Thanks, I was just about to open a PR.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #18 from Robin Dapp --- A bit of a follow-up: I'm working on a patch for reassociation that can handle the mentioned cases and some more but it will still require a bit of time to get everything regression free and correct. What it does is allow reassoc to look through constant multiplications and negates to provide more freedom in the optimization process. Regarding the mentioned element-wise costing how should we proceed here? I'm going to remove the hunk in question, run SPEC2017 on x86 and post a patch in order to get some data and basis for discussion.
[Bug middle-end/114196] [13 Regression] Fixed length vector ICE: in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9454
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114196 --- Comment #7 from Robin Dapp --- I can barely build a compiler on gcc185 due to disk space. I'm going to set up a cross toolchain (that I need for other purposes as well) in order to test.
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #10 from Robin Dapp --- Yes it helps. Great that get_gimple_for_ssa_name is right below get_rtx_for_ssa_name that I stepped through several times while debugging and I didn't realize the connection, g. But thanks! Good thing it can be solved like that. I cannot do a bootstrap/regtest for aarch64 because cfarm185 is almost out of disk space. As the bug is old and very unlikely to trigger it can surely wait for GCC15?
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #8 from Robin Dapp --- Created attachment 58037 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58037=edit Expand dump Dump attached. Insn 209 is the problematic one. The changing from _911 to 1078 happens in internal-fn.cc:expand_call_mem_ref (and not via TER). The lookup there is simple and I was also wondering if there is some single_imm_use or so missing.
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 Robin Dapp changed: What|Removed |Added CC||rguenth at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org --- Comment #6 from Robin Dapp --- This one is really a bit tricky. We have the following situation: loop: # vectp_g.178_1078 = PHI _911 = vectp_g.178_1078 MASK_LEN_LOAD (_911, ...); vectp_g.178_1079 = vectp_g.178_1078 + 16; goto loop; : MASK_LEN_LOAD (_911, ...); During expand we basically convert back the _911 to vectp_g.178_1078 (reverting what we did in ivopts before). Because _911 camouflages vectp_g.178_1078 until expand we evaded the conflict checks of outof-ssa that would catch a similar, non-camouflaged situation like: # vectp_g.178_1078 = PHI MASK_LEN_LOAD (MEM... vectp_g.178_1078, ...); vectp_g.178_1079 = vectp_g.178_1078 + 16; goto loop; MASK_LEN_LOAD (MEM... vectp_g.178_1078, ...); and would insert a copy of the definition right before the backedge. The MASK_LEN_LOAD after the loop would then use that copy. By using _911 instead of the original pointer no conflict is detected and we wrongly use the incremented pointer. Without the ivopt change for TARGET_MEM_REF Unless I'm misunderstanding some basic mechanism it's not going to work like that (and we could also have this situation on aarch64). What could help is to enhance trivially_conflicts_p in outof-ssa to catch such TARGET_MEM_REFs here and handle them similarly to a normal conflict. I did that locally and it helps for this particular case but I'd rather not post it in its current hacky state even if the riscv testsuite looks ok :) Even if that were the correct solution I'd doubt it should land in stage 4. CC'ing Richard Sandiford as he originally introduced the ivopts and expand handling.
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #5 from Robin Dapp --- What happens is that code sinking does: Sinking # VUSE <.MEM_1235> vect__173.251_1238 = .MASK_LEN_LOAD (_911, 32B, { -1, -1, -1, -1 }, loop_len_1064, 0); from bb 3 to bb 4 so we have vect__173.251_1238 = .MASK_LEN_LOAD (_911, 32B, { -1, -1, -1, -1 }, loop_len_1064, 0); after the loop. When expanding this stmt expand_call_mem_ref creates a mem reference to vectp_g.178 for _911 (== vectp_g.178_1078). This is expanded to the same rtl as vectp_g.178_1079 (which is incremented before the latch as opposed to ...1078 which is not). Disabling sinking or expand_call_mem_ref both help but neither is correct of course :) I don't have a solution yet but I'd hope we're a bit closer to the problem now.
[Bug target/114714] [RISC-V][RVV] ICE: insn does not satisfy its constraints (postreload)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114714 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #5 from Robin Dapp --- Did anybody do some further investigation here? Juzhe messaged me that this PR is the original reason for the reversal but I don't yet understand why the register filters don't encompass the full semantics of RVV overlap. I looked into the test case and what happens is that, in order to determine the validity of the alternatives, riscv_get_v_regno_alignment is first being called with an M2 mode. Our destination is actually a (subreg:RVVM2SI (reg:RVVM4SI ...) 0), though. I suppose lra/reload check whether a non-subreg destination also works and hands us a (reg:RVVM4SI ...) as operand[0]. We pass this to riscv_get_v_regno_alignment which, for an LMUL4 mode, returns 4, thus wrongly enabling the W42 alternatives. A W42 alternative permits hard regs % 4 == 2, which causes us to eventually choose vr2 as destination and source. Once the constraints are actually checked we have a mismatch as none of the alternatives work. Now I'm not at all sure how lra/reload use operand[0] here but this can surely be found out. A quick and dirty hack (attached) that checks the insn's destination mode instead of operand[0]'s mode gets rid of the ICE and doesn't cause regressions. I suppose we're too far ahead with the reversal already but I'd really have preferred more details. Maybe somebody has had in-depth look but it just wasn't posted yet? --- a/gcc/config/riscv/riscv.cc +++ b/gcc/config/riscv/riscv.cc @@ -6034,6 +6034,22 @@ riscv_get_v_regno_alignment (machine_mode mode) return lmul; } +int +riscv_get_dest_alignment (rtx_insn *insn, rtx operand) +{ + const_rtx set = 0; + if (GET_CODE (PATTERN (insn)) == SET) +{ + set = PATTERN (insn); + rtx op = SET_DEST (set); + return riscv_get_v_regno_alignment (GET_MODE (op)); +} + else +{ + return riscv_get_v_regno_alignment (GET_MODE (operand)); +} +} + /* Define ASM_OUTPUT_OPCODE to do anything special before emitting an opcode. */ const char * diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md index ce1ee6b9c5e..5113daf2ac7 100644 --- a/gcc/config/riscv/riscv.md +++ b/gcc/config/riscv/riscv.md @@ -550,15 +550,15 @@ (define_attr "group_overlap_valid" "no,yes" (const_string "yes") (and (eq_attr "group_overlap" "W21") - (match_test "riscv_get_v_regno_alignment (GET_MODE (operands[0])) != 2")) + (match_test "riscv_get_dest_alignment (insn, operands[0]) != 2")) (const_string "no") (and (eq_attr "group_overlap" "W42") - (match_test "riscv_get_v_regno_alignment (GET_MODE (operands[0])) != 4")) + (match_test "riscv_get_dest_alignment (insn, operands[0]) != 4")) (const_string "no") (and (eq_attr "group_overlap" "W84") - (match_test "riscv_get_v_regno_alignment (GET_MODE (operands[0])) != 8")) + (match_test "riscv_get_dest_alignment (insn, operands[0]) != 8")) (const_string "no")
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #4 from Robin Dapp --- Ok, it looks like we do 5 iterations with the last one being length-masked to length 2 and then in the "live extraction" phase use "iteration 6".
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #3 from Robin Dapp --- > probably -fwhole-program is enough, -flto not needed(?) Yes, -fwhole-program is sufficient. > > # vectp_g.248_1401 = PHI > ... > _1411 = .SELECT_VL (ivtmp_1409, POLY_INT_CST [2, 2]); > .. > vect__193.250_1403 = .MASK_LEN_LOAD (vectp_g.248_1401, 32B, { -1, ... }, > _1411, 0); > vect__194.251_1404 = -vect__193.250_1403; > vect_iftmp.252_1405 = (vector([2,2]) long int) vect__194.251_1404; > > # vect_iftmp.252_1406 = PHI > # loop_len_1427 = PHI <_1411(5)> > ... > _1407 = loop_len_1427 + 18446744073709551615; > _1408 = .VEC_EXTRACT (vect_iftmp.252_1406, _1407); > iftmp.3_1204 = _1408; > > is stored to b[15]. Doesn't look too odd to me. At the assembly equivalent of > vect__193.250_1403 = .MASK_LEN_LOAD (vectp_g.248_1401, 32B, { -1, ... }, > _1411, 0); we load [3 3] (=f) instead of [0 0] (=g). f is located after g in memory and register a3 is increased before the loop latch. We then re-use a3 to load the last two elements of g but actually read the first two of f.
[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734 --- Comment #1 from Robin Dapp --- Confirmed.
[Bug middle-end/114733] [14] Miscompile with -march=rv64gcv -O3 on riscv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114733 --- Comment #1 from Robin Dapp --- Confirmed, also shows up here.
[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665 --- Comment #5 from Robin Dapp --- Weird, I tried your exact qemu version and still can't reproduce the problem. My results are always FFB5. Binutils difference? Very unlikely. Could you post your QEMU_CPU settings just to be sure?
[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668 Robin Dapp changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #4 from Robin Dapp --- I didn't have the time to fully investigate but the default path without vec extract is definitely broken for masks. I'd probably sleep better if we fixed that at some point but for now the obvious fix is to add the missing expanders. Patrick, I'm still unable to reproduce PR114665 (maybe also a qemu difference?). Could you re-check with this fix? Thanks.
[Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686 --- Comment #3 from Robin Dapp --- I think we have always maintained that this can definitely be a per-uarch default but shouldn't be a generic default. > I don't see any reason why this wouldn't be the case for the vast majority of > implementations, especially high performance ones would benefit from having > more work to saturate the execution units with, since a larger LMUL works > quite > similar to loop unrolling. One argument is reduced freedom for renaming and the out of order machinery. It's much easier to shuffle individual registers around than large blocks. Also lower-latency insns are easier to schedule than longer-latency ones and faults, rejects, aborts etc. get proportionally more expensive. I was under the impression that unrolling doesn't help a whole lot (sometimes even slows things down a bit) on modern cores and certainly is not unconditionally helpful. Granted, I haven't seen a lot of data on it recently. An exception is of course breaking dependency chains. In general nothing stands in the way of having a particular tune target use dynamic LMUL by default even now but nobody went ahead and posted a patch for theirs. One could maybe argue that it should be the default for in-order uarchs? Should it become obvious in the future that LMUL > 1 is indeed, unconditionally, a "better unrolling" because of its favorable icache footprint and other properties (which I doubt - happy to be proved wrong) then we will surely re-evaluation the decision or rather have a different consensus. The data we publicly have so far is all in-order cores and my expectation is that the picture will change once out-of-order cores hit the scene.
[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668 --- Comment #2 from Robin Dapp --- This, again, seems to be a problem with bit extraction from masks. For some reason I didn't add the VLS modes to the corresponding vec_extract patterns. With those in place the problem is gone because we go through the expander which does the right thing. I'm still checking what exactly goes wrong without those as there is likely a latent bug.
[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665 --- Comment #2 from Robin Dapp --- Checked with the latest commit on a different machine but still cannot reproduce the error. PR114668 I can reproduce. Maybe a copy and paste problem?
[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665 --- Comment #1 from Robin Dapp --- Hmm, my local version is a bit older and seems to give the same result for both -O2 and -O3. At least a good starting point for bisection then.
[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247 --- Comment #6 from Robin Dapp --- Testsuite looks unchanged on rv64gcv.
[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247 --- Comment #5 from Robin Dapp --- This fixes the test case for me locally, thanks. I can run the testsuite with it later if you'd like.
[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vect-cost-model (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476 --- Comment #8 from Robin Dapp --- I tried some things (for the related bug without -fwrapv) then got busy with some other things. I'm going to have another look later this week.
[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515 Robin Dapp changed: What|Removed |Added CC||ewlu at rivosinc dot com, ||rdapp at gcc dot gnu.org --- Comment #7 from Robin Dapp --- There is some riscv fallout as well. Edwin has the details.
[Bug tree-optimization/114485] [13/14 Regression] Wrong code with -O3 -march=rv64gcv on riscv or `-O3 -march=armv9-a` for aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114485 --- Comment #4 from Robin Dapp --- Yes, the vectorization looks ok. The extracted live values are not used afterwards and therefore the whole vectorized loop is being thrown away. Then we do one iteration of the epilogue loop, inverting the original c and end up with -8 instead of 8. This is pretty similar to what's happening in the related PR. We properly populate the phi in question in slpeel_update_phi_nodes_for_guard1: c_lsm.7_64 = PHI <_56(23), pretmp_34(17)> but vect_update_ivs_after_vectorizer changes that into c_lsm.7_64 = PHI . Just as a test, commenting out if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)) vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf, update_e); at least makes us keep the VEC_EXTRACT and not fail anymore.
[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vector-cost-mode (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476 --- Comment #5 from Robin Dapp --- So the result is -9 instead of 9 (or vice versa) and this happens (just) with vectorization. We only vectorize with -fwrapv. >From a first quick look, the following is what we have before vect: (loop) [local count: 991171080]: ... # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)> ... _4 = -b_lsm.5_5; (check) [local count: 82570744]: ... # b_lsm.5_22 = PHI ... if (b_lsm.5_22 != -9) I.e. b gets negated with every iteration and we check the second to last against -9. With vectorization we have: (init) [local count: 82570744]: b_lsm.5_17 = b; (vectorized loop) [local count: 247712231]: ... # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)> ... _4 = -b_lsm.5_5; ... goto (epilogue) [local count: 82570741]: ... # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)> ... _25 = -b_lsm.5_7; (check) [local count: 82570744]: ... # b_lsm.5_22 = PHI if (b_lsm.5_22 != -9) What looks odd here is that b_lsm.5_7's fallthrough argument is b_lsm.5_17 even though we must have come through the vectorized loop (which negated b at least once). This makes us skip inversions. Indeed, as b_lsm.5_22 is only dependent on the initial value of b it gets optimized away and we compare b != -9. Maybe I missed something but it looks like # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)> should have b_lsm.5_5 or _4 as fallthrough argument.
[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #8 from Robin Dapp --- No fallout on x86 or aarch64. Of course using false instead of TYPE_SIGN (utype) is also possible and maybe clearer?
[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #7 from Robin Dapp --- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4375ebdcb49..f8f7ba0ccc1 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -9454,7 +9454,7 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree init_expr, wi::to_mpz (skipn, exp, UNSIGNED); mpz_ui_pow_ui (mod, 2, TYPE_PRECISION (type)); mpz_powm (res, base, exp, mod); - begin = wi::from_mpz (type, res, TYPE_SIGN (type)); + begin = wi::from_mpz (type, res, TYPE_SIGN (utype)); tree mult_expr = wide_int_to_tree (utype, begin); init_expr = gimple_build (stmts, MULT_EXPR, utype, init_expr, mult_expr); This helps for the test case.
[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #3 from Robin Dapp --- -O3 -mavx2 -fno-vect-cost-model -fwrapv seems to be sufficient.
[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 Robin Dapp changed: What|Removed |Added Target|riscv*-*-* |x86_64-*-* riscv*-*-* --- Comment #2 from Robin Dapp --- At first glance it doesn't really look like a target issue. Tried it on x86 and it fails as well with -O3 -march=native pr114396.c -fno-vect-cost-model -fwrapv short a = 0xF; short b[16]; int main() { for (int e = 0; e < 9; e += 1) b[e] = a *= 0x5; if (a != 2283) __builtin_abort (); }
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #29 from Robin Dapp --- Yes, that also appears to work here. There was no lto involved this time? Now we need to figure out what's different with SPEC.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #27 from Robin Dapp --- Can you try it with a simpler (non SPEC) test? Maybe there is still something weird happening with SPEC's scripting.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #24 from Robin Dapp --- I rebuilt GCC from scratch with your options but still have the same problem. Could our sources differ? My SPEC version might not be the most recent but I'm not aware that mcf changed at some point. Just to be sure: I'm using r14-5075-gc05f748218a0d5 as the "before" commit.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #22 from Robin Dapp --- Still the same problem unfortunately. I'm a bit out of ideas - maybe your compiler executables could help?
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #20 from Robin Dapp --- No change with -std=gnu99 unfortunately.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #18 from Robin Dapp --- Hmm, doesn't help unfortunately. A full command line for me looks like: x86_64-pc-linux-gnu-gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG -DWANT_STDC_PROTO -Ofast -march=znver4 -mtune=znver4 -flto=32 -g -fprofile-use=/tmp -SPEC_CPU_LP64 pbeampp.c. Could you verify if it's exactly the same for you? Maybe it would also help if you explicitly specified znver4?
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #16 from Robin Dapp --- Thank you! I'm having a problem with the data, though. Compiling with -Ofast -march=znver4 -mtune=znver4 -flto -fprofile-use=/tmp. Would you mind showing your exact final options for compilation of e.g. pbeampp.cc? I see, similar-ish for both commits: pbeampp.c:119:8: error: number of counters in profile data for function 'primal_bea_mpp' does not match its profile data (counter 'arcs', expected 20 and have 22) [-Werror=coverage-mismatch] output.c:87:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 1 output.c:87:1: error: corrupted profile info: number of executions for edge 3-5 thought to be -1 output.c:87:1: error: corrupted profile info: number of iterations for basic block 5 thought to be -1
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #10 from Robin Dapp --- (In reply to Sam James from comment #9) > (In reply to Filip Kastl from comment #8) > > I'd like to help but I'm afraid I cannot send you the SPEC binaries with PGO > > applied since SPEC is licensed nor can I give you access to a Zen4 computer. > > I suppose someone else will have to analyze this bug. > > Could you perhaps send only the gcda files so Robin can build again with > -fprofile-use? Yes, that would be helpful. Or Filip builds the executables himself and posts (some of) the difference here. Maybe that also gets us a bit closer to the problem.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #7 from Robin Dapp --- I built executables with and without the commit (-Ofast -march=znver4 -flto). There is no difference so it must really be something that happens with PGO. I'd really need access to a zen4 box or the pgo executables at least.
[Bug target/114202] [14] RISC-V rv64gcv: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114202 Robin Dapp changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from Robin Dapp --- Same as PR114200. *** This bug has been marked as a duplicate of bug 114200 ***
[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200 --- Comment #3 from Robin Dapp --- *** Bug 114202 has been marked as a duplicate of this bug. ***
[Bug middle-end/114196] [13/14 Regression] Fixed length vector ICE: in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9454
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114196 Robin Dapp changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=113163 --- Comment #2 from Robin Dapp --- To me this looks like it already came up in the context of early-break vectorization (PR113163) but is not actually dependent on it. I'm testing a patch that disables epilogue peeling also without early break.
[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200 --- Comment #1 from Robin Dapp --- Took me a while to analyze this... needed more time than I'd like to admit to make sense of the somewhat weird code created by fully unrolling and peeling. I believe the problem is that we reload the output register of a vfmacc/fma via vmv.v.v (subject to length masking) but we should be using vmv1r.v. The result is used by a reduction which always operates on the full length. As annoying as it was to find - it's definitely a good catch. I'm testing a patch. PR114202 is indeed a duplicate. Going to add its test case to the patch.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #6 from Robin Dapp --- Honestly, I don't know how to analyze/debug this without a zen4, in particular as it only seems to happen with PGO. I tried locally but of course the execution time doesn't change (same as with zen3 according to the database). Is there a way to obtain the binaries in order to tell a difference?
[Bug middle-end/114109] x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 --- Comment #4 from Robin Dapp --- Yes, as mentioned, vectorization of the first loop is debatable.
[Bug middle-end/114109] x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 --- Comment #2 from Robin Dapp --- It is vectorized with a higher zvl, e.g. zvl512b, refer https://godbolt.org/z/vbfjYn5Kd.
[Bug middle-end/114109] New: x264 satd vectorization vs LLVM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109 Bug ID: 114109 Summary: x264 satd vectorization vs LLVM Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* riscv*-*-* Looking at the following code of x264 (SPEC 2017): typedef unsigned char uint8_t; typedef unsigned short uint16_t; typedef unsigned int uint32_t; static inline uint32_t abs2 (uint32_t a) { uint32_t s = ((a >> 15) & 0x10001) * 0x; return (a + s) ^ s; } int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2) { uint32_t tmp[4][4]; uint32_t a0, a1, a2, a3; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); { int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][1] = t1 + t3; tmp[i][2] = t0 - t2; tmp[i][3] = t1 - t3; }; } for( int i = 0; i < 4; i++ ) { { int t0 = tmp[0][i] + tmp[1][i]; int t1 = tmp[0][i] - tmp[1][i]; int t2 = tmp[2][i] + tmp[3][i]; int t3 = tmp[2][i] - tmp[3][i]; a0 = t0 + t2; a2 = t0 - t2; a1 = t1 + t3; a3 = t1 - t3; }; sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3); } return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1; } I first checked on riscv but x86 and aarch64 are pretty similar. (Refer https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f) Vectorizing the first loop seems to be a costing issue. By default we don't vectorize and the code becomes much larger when disabling vector costing, so the costing decision in itself seems correct. Clang's version is significantly shorter and it looks like it just directly vec_sets/vec_inits the individual elements. On riscv it can be handled rather elegantly with strided loads that we don't emit right now. As there are only 4 active vector elements and the loop is likely load bound it might be debatable whether LLVM's version is better? The second loop we do vectorize (4 elements at a time) but end up with e.g. four XORs for the four inlined abs2 calls while clang chooses a larger vectorization factor and does all the xors in one. On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM) but I guess the general case is still interesting?
[Bug target/114028] [14] RISC-V rv64gcv_zvl256b: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114028 --- Comment #2 from Robin Dapp --- This is a target issue. It looks like we try to construct a "superword" sequence when the element size is already == Pmode. Testing a patch.
[Bug target/114027] [14] RISC-V vector: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027 --- Comment #9 from Robin Dapp --- Argh, I actually just did a gcc -O3 -march=native pr114027.c -fno-vect-cost-model on cfarm188 with a recent-ish GCC but realized that I used my slightly modified version and not the original test case. long a; int b[10][8] = {{}, {}, {}, {}, {}, {}, {0, 0, 0, 0, 0, 1, 1}, {1, 1, 1, 1, 1, 1, 1}, {1, 1, 1, 1, 1, 1, 1}}; int c; int main() { int d; c = 0x; for (; a < 6; a++) { d = 0; for (; d < 6; d++) { c ^= -3L; if (b[a + 3][d]) continue; c = 0; } } if (c == -3) { return 0; } else { return 1; } } This was from an initial attempt to minimize it further but I didn't really verify if I'm breaking the test case by that (or causing undefined behavior). With that I get a "1" with default options and "0" with -fno-tree-vectorize. Maybe my snippet is broken then?
[Bug target/114027] [14] RISC-V vector: miscompile at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027 Robin Dapp changed: What|Removed |Added CC||rguenth at gcc dot gnu.org Last reconfirmed||2024-2-22 Target|riscv |x86_64-*-* riscv*-*-* ||aarch64-*-* --- Comment #5 from Robin Dapp --- To me it looks like we interpret e.g. c_53 = _43 ? prephitmp_13 : 0 as the only reduction statement and simplify to MAX because of the wrong assumption that this is the only reduction statement in the chain when we actually have several. (See "condition expression based on compile time constant"). --- Comment #6 from Robin Dapp --- Btw this fails on x86 and aarch64 for me with -fno-vect-cost-model. So it definitely looks generic.
[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548 --- Comment #4 from Robin Dapp --- Judging by the graph it looks like it was slow before, then got faster and now slower again. Is there some more info on why it got faster in the first place? Did the patch reverse something or is it rather a secondary effect? I don't have a zen4 handy to check.
[Bug target/113827] MrBayes benchmark redundant load on riscv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827 --- Comment #1 from Robin Dapp --- x86 (-march=native -O3 on an i7 12th gen) looks pretty similar: .L3: movq(%rdi), %rax vmovups (%rax), %xmm1 vdivps %xmm0, %xmm1, %xmm1 vmovups %xmm1, (%rax) addq$16, %rax movq%rax, (%rdi) addq$8, %rdi cmpq%rdi, %rdx jne .L3 So probably not target specific. Costing?
[Bug target/113827] New: MrBayes benchmark redundant load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827 Bug ID: 113827 Summary: MrBayes benchmark redundant load Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org, pan2.li at intel dot com Blocks: 79704 Target Milestone: --- Target: riscv A hot block in the MrBayes benchmark (as used in the Phoronix testsuite) has a redundant scalar load when vectorized. Minimal example, compiled with -march=rv64gcv -O3 int foo (float **a, float f, int n) { for (int i = 0; i < n; i++) { a[i][0] /= f; a[i][1] /= f; a[i][2] /= f; a[i][3] /= f; a[i] += 4; } } GCC: .L3: ld a5,0(a0) vle32.v v1,0(a5) vfmul.vvv1,v1,v2 vse32.v v1,0(a5) addia5,a5,16 sd a5,0(a0) addia0,a0,8 bne a0,a4,.L3 The value of a5 doesn't change after the store to 0(a0). LLVM: .L3 vle32.v v8,(a1) addi a3,a1,16 sda3,0(a2) vfdiv.vf v8,v8,fa5 addi a2,a2,8 vse32.v v8,(a1) bne a2,a0,.L3 Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704 [Bug 79704] [meta-bug] Phoronix Test Suite compiler performance issues
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #23 from Robin Dapp --- > this is: > > _429 = mask_patt_205.47_276[i] ? vect_cst__262[i] : (vect_cst__262 << > {0,..})[i]; > vect_iftmp.55_287 = mask_patt_209.54_286[i] ? _429 [i] : vect_cst__262[i] But isn't it rather _429 = mask_patt_205.47_276[i] ? (vect_cst__262[i] << vect_cst__262[i]) : {0,..})[i]? The else should be the last operand, shouldn't it? On aarch64 we don't seem to emit a COND_SHL therefore this particular situation does not occur. However the simplification was introduced for aarch64: (for cond_op (COND_BINARY) (simplify (vec_cond @0 (cond_op:s @1 @2 @3 @4) @3) (cond_op (bit_and @1 @0) @2 @3 @4))) It is supposed to simplify (in gcc.target/aarch64/sve/pre_cond_share_1.c) _256 = .COND_MUL (mask__108.48_193, vect_iftmp.45_187, vect_cst__190, { 0.0, ... }); vect_prephitmp_151.50_197 = VEC_COND_EXPR ; into COND_MUL (mask108 & mask101, vect_iftmp.45_187, vect_cst__190, { 0.0, ... }); But that doesn't look valid to me either. No matter what _256 is, the result for !mask101 should be vect_cst__190 and not 0.0.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #19 from Robin Dapp --- What seems odd to me is that in fre5 we simplify _429 = .COND_SHL (mask_patt_205.47_276, vect_cst__262, vect_cst__262, { 0, ... }); vect_prephitmp_129.51_282 = _429; vect_iftmp.55_287 = VEC_COND_EXPR ; to Applying pattern match.pd:9607, gimple-match-10.cc:3817 gimple_simplified to vect_iftmp.55_287 = .COND_SHL (mask_patt_205.47_276, vect_cst__262, vect_cst__262, { 0, ... }); so fold vec_cond (mask209, prephitmp129, vect_cst262) with prephitmp129 = cond_shl (mask205, vect_cst262, vect_cst262, 0) into cond_shl = (mask205, vect_cst262, vect_cst262, 0)? That doesn't look valid to me because the vec_cond's else value (vect_cst262) gets lost. Wouldn't such a simplification have a conditional else value? Like !mask1 ? else1 : else2 instead of else2 unconditionally?
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #18 from Robin Dapp --- Hehe no it doesn't make sense... I wrongly read a v2 as a v1. Please disregard the last message.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #17 from Robin Dapp --- Grasping for straws by blaming qemu ;) At some point we do the vector shift vsll.vv v1,v2,v2,v0.t but the mask v0 is all zeros: gdb: b = {0 } According to the mask-undisturbed policy set before vsetvli zero,zero,e32,mf2,ta,mu all elements should be unchanged. I'm seeing an all-zeros result in v1, though. v1 is used as 'j', is zero and therefore 'q' is not incremented and we don't assign c = d causing the wrong result. Before the shift I see v2 in gdb as: w = {4294967295, 4294967295, 0, 0} (That's also a bit dubious because we load 2 elements from 'g' of which only one should be -1. This doesn't change the end result, though.) After the shift gdb shows v1 as: w = {0, 0, 0, 0}, when it should be w = {-1, -1, 0, 0}. Does this make sense?
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #16 from Robin Dapp --- Disabling vec_extract makes us operate on non-partial vectors, though so there are a lot of differences in codegen. I'm going to have a look.
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #9 from Robin Dapp --- (In reply to rguent...@suse.de from comment #6) > t.c:47:21: missed: the size of the group of accesses is not a power of 2 > or not equal to 3 > t.c:47:21: missed: not falling back to elementwise accesses > t.c:58:15: missed: not vectorized: relevant stmt not supported: _4 = > *_3; > t.c:47:21: missed: bad operation or unsupported loop bound. > > where we don't consider using gather because we have a known constant > stride (20). Since the stores are really scatters we don't attempt > to SLP either. > > Disabling the above heuristic we get this vectorized as well, avoiding > gather/scatter by manually implementing them and using a quite high > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely > faster code in the end). I suppose you're referring to this? /* FIXME: At the moment the cost model seems to underestimate the cost of using elementwise accesses. This check preserves the traditional behavior until that can be fixed. */ stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info); if (!first_stmt_info) first_stmt_info = stmt_info; if (*memory_access_type == VMAT_ELEMENTWISE && !STMT_VINFO_STRIDED_P (first_stmt_info) && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info) && !DR_GROUP_NEXT_ELEMENT (stmt_info) && !pow2p_hwi (DR_GROUP_SIZE (stmt_info { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not falling back to elementwise accesses\n"); return false; } I did some more tests on my laptop. As said above the whole loop in lbm is larger and contains two ifs. The first one prevents clang and GCC from vectorizing the loop, the second one if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) { ux = 0.005; uy = 0.002; uz = 0.000; } seems to be if-converted? by clang or at least doesn't inhibit vectorization. Now if I comment out the first, larger if clang does vectorize the loop. With the return false commented out in the above GCC snippet GCC also vectorizes, but only when both ifs are commented out. Results (with both ifs commented out), -march=native (resulting in avx2), best of 3 as lbm is notoriously fickle: gcc trunk vanilla: 156.04s gcc trunk with elementwise: 132.10s clang 17: 143.06s Of course even the comment already said that costing is difficult and the change will surely cause regressions elsewhere. However the 15% improvement with vectorization (or the 9% improvement of clang) IMHO show that it's surely useful to look into this further. On top, the riscv clang seems to not care about the first if either and still vectorize. I haven't looked closer what happens there, though.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #10 from Robin Dapp --- The compile farm machine I'm using doesn't have SVE. Compiling with -march=armv8-a -O3 pr113607.c -fno-vect-cost-model and running it returns 0 (i.e. ok). pr113607.c:35:5: note: vectorized 3 loops in function.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #7 from Robin Dapp --- Yep, that one fails for me now, thanks.
[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607 --- Comment #4 from Robin Dapp --- I cannot reproduce it either, tried with -ftree-vectorize as well as -fno-vect-cost-model.
[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575 --- Comment #14 from Robin Dapp --- Ok, running tests with the adjusted version and going to post a patch afterwards. However, during a recent run compiling insn-recog took 2G and insn-emit-7 as well as insn-emit-10 required > 1.5G each. Looks like they could cause problems as well then? The insn-emit files can be split into 20 instead of 10 which might help but insn-recog I haven't had a look at yet.
[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575 --- Comment #12 from Robin Dapp --- Created attachment 57209 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57209=edit Tentative I tested the attached "fix". On my machine with 13.2 host compiler it reduced the build time for insn-opinit.cc from > 4 mins to < 2 mins and the memory usage from >1G to 600ish M. I didn't observe 3.5G before, though. For now I just went with an arbitrary threshold of 5000 patterns and splitting into 10 functions. After testing on x86 and aarch64 I realized that both have <3000 patterns so right now it would only split riscv's init function. Or rather the other way, i.e. splitting into fixed-size chunks (of 1000) instead?
[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 --- Comment #2 from Robin Dapp --- > It's interesting, for Clang only RISC-V can vectorize it. The full loop can be vectorized on clang x86 as well when I remove the first conditional (which is not in the snippet I posted above). So that's likely a different issue than the loop itself.
[Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583 Bug ID: 113583 Summary: Main loop in 519.lbm not vectorized. Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* riscv*-*-* This might be a known issue but a bugzilla search regarding lbm didn't show anything related. The main loop in SPEC2017 519.lbm GCC riscv does not vectorize while clang does. For x86 neither clang nor GCC seem to vectorize it. A (not entirely minimal but let's start somewhere) example is the following. This one is, however, vectorized by clang-17 x86 and not by GCC trunk x86 or other targets I checked. #define CST1 (1.0 / 3.0) typedef enum { C = 0, N, S, E, W, T, B, NW, NE, A, BB, CC, D, EE, FF, GG, HH, II, JJ, FLAGS, NN } CELL_ENTRIES; #define SX 100 #define SY 100 #define SZ 130 #define CALC_INDEX(x, y, z, e) ((e) + NN * ((x) + (y) * SX + (z) * SX * SY)) #define GRID_ENTRY_SWEEP(g, dx, dy, dz, e) ((g)[CALC_INDEX (dx, dy, dz, e) + (i)]) #define LOCAL(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e)) #define NEIGHBOR_C(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e)) #define NEIGHBOR_S(g, e) (GRID_ENTRY_SWEEP (g, 0, -1, 0, e)) #define NEIGHBOR_N(g, e) (GRID_ENTRY_SWEEP (g, 0, +1, 0, e)) #define NEIGHBOR_E(g, e) (GRID_ENTRY_SWEEP (g, +1, 0, 0, e)) #define SRC_C(g) (LOCAL (g, C)) #define SRC_N(g) (LOCAL (g, N)) #define SRC_S(g) (LOCAL (g, S)) #define SRC_E(g) (LOCAL (g, E)) #define SRC_W(g) (LOCAL (g, W)) #define DST_C(g) (NEIGHBOR_C (g, C)) #define DST_N(g) (NEIGHBOR_N (g, N)) #define DST_S(g) (NEIGHBOR_S (g, S)) #define DST_E(g) (NEIGHBOR_E (g, E)) typedef double arr[SX * SY * SZ * NN]; #define OMEGA 0.123 void foo (arr src, arr dst) { double ux, uy, u2; const double lambda0 = 1.0 / (0.5 + 3.0 / (16.0 * (1.0 / OMEGA - 0.5))); double fs[NN], fa[NN], feqs[NN], feqa[NN]; for (int i = 0; i < SX * SY * SZ * NN; i += NN) { ux = 1.0; uy = 1.0; feqs[C] = CST1 * (1.0); feqs[N] = feqs[S] = CST1 * (1.0 + 4.5 * (+uy) * (+uy)); feqa[C] = 0.0; feqa[N] = 0.2; fs[C] = SRC_C (src); fs[N] = fs[S] = 0.5 * (SRC_N (src) + SRC_S (src)); fa[C] = 0.0; fa[N] = 0.1; DST_C (dst) = SRC_C (src) - OMEGA * (fs[C] - feqs[C]); DST_N (dst) = SRC_N (src) - OMEGA * (fs[N] - feqs[N]) - lambda0 * (fa[N] - feqa[N]); } } missed.c:19:2: note: ==> examining statement: _4 = *_3; missed.c:19:2: missed: no array mode for V8DF[20] missed.c:19:2: missed: no array mode for V8DF[20] missed.c:19:2: missed: the size of the group of accesses is not a power of 2 or not equal to 3 missed.c:19:2: missed: not falling back to elementwise accesses missed.c:43:11: missed: not vectorized: relevant stmt not supported: _4 = *_3; Also refer to https://godbolt.org/z/P517qc3Yf for riscv and https://godbolt.org/z/M134KvEEo for aarch64. For aarch64 it seems clang would vectorize the snippet but does not consider it profitable to do so. For riscv and the full lbm workload I roughly see one third the number of dynamically executed qemu instructions with the clang build vs GCC build, 340 billion vs 1200 billion.
[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575 --- Comment #7 from Robin Dapp --- Ok, I'm going to check.
[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575 Robin Dapp changed: What|Removed |Added CC||rdapp at gcc dot gnu.org --- Comment #5 from Robin Dapp --- Yes, this is a known issue and it's due to our large number of patterns. Contrary to insn-emit insn-opinit cannot be split that easily. It would probably need a tree-like approach or similar. I wouldn't see this as a regression in the classical sense as we just have many more patterns because of the vector extension. Is increasing the available memory an option in the meantime or does this urgently require fixing?
[Bug target/113570] RISC-V: SPEC2017 549 fotonik3d miscompilation in autovec VLS 256 build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570 --- Comment #2 from Robin Dapp --- I'm pretty certain this is "works as intended" and -Ofast causes the precision to be different than with -O3 (and dependant on the target). See also: It has been reported that with gfortran -Ofast -march=native verification errors may be seen, for example: *** Miscompare of pscyee.out; for details see /data2/johnh/out.v1.1.5/benchspec/CPU/549.fotonik3d_r/run/run_base_refrate_Ofastnative./pscyee.out.mis 0646: -1.91273086037953E-17, -1.46491401919706E-15, -1.91273086057460E-17, -1.46491401919687E-15, ^ 0668: -1.91251317582607E-17, -1.42348205527085E-15, -1.91251317602571E-17, -1.42348205527068E-15, ^ The errors may occur with other compilers as well, depending on your particular compiler version, hardware platform, and optimization options. The problem arises when a compiler chooses to vectorize a particular loop from power.F90 line number 369 369 do ifreq = 1, tmppower%nofreq 370 frequency(ifreq,ipower) = freq 371 freq = freq + freqstep 372 end do from https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html which further states: Workaround: You will need to specify optimization options that do not cause this loop to be vectorized. For example, on a particular platform studied in mid-2020 using GCC 10.2, these results were seen: OK -Ofast -march=native -fno-unsafe-math-optimization If you apply one of the above workarounds in base, be sure to obey the same-for-all rule which requires that all benchmarks in a suite of a given language must use the same flags. For example, the sections below turn off unsafe math optimizations for all Fortran modules in the floating point rate and floating point speed benchmark suites: default=base: OPTIMIZE = -Ofast -flto -march=native fprate,fpspeed=base: FOPTIMIZE = -fno-unsafe-math-optimizations
[Bug testsuite/113558] [14 regression] gcc.dg/vect/vect-outer-4c-big-array.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113558 --- Comment #2 from Robin Dapp --- Created attachment 57195 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57195=edit Tentative patch Ah, it looks like nothing is being vectorized at all and the second check just happened to match as part of the unsuccessful vectorization attempt. It would seem that we need the same condition as for the first check as well. Would you mind giving the attached patch a try? I ran it on riscv and power10 so far, x86 and aarch64 are still in progress.
[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087 --- Comment #38 from Robin Dapp --- deepsjeng also looks ok here.
[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087 --- Comment #37 from Robin Dapp --- > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113206#c9 > Using 4a0a8dc1b88408222b88e10278017189f6144602, the spec run failed on: > zvl128b (All runtime fails): > 527.cam4 (Runtime) > 531.deepsjeng (Runtime) > 521.wrf (Runtime) > 523.xalancbmk (Runtime) I tried reproducing the xalanc fail first but with the current trunk I don't see a runtime fail. Going to try deepsjeng next.
[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495 --- Comment #27 from Robin Dapp --- Following up on this: I'm seeing the same thing Patrick does. We create a lot of large non-sparse sbitmaps that amount to around 33G in total. I did local experiments replacing all sbitmaps that are not needed for LCM by regular bitmaps. Apart from output differences vs the original version the testsuite is unchanged. As expected, wrf now takes longer to compiler, 8 mins vs 4ish mins before and we still use 2.7G of RAM for this single file (Likely because of the remaining sbitmaps) compared to a max of 1.2ish G that the rest of the commpilation uses. One possibility to get the best of both worlds would be to threshold based on num_bbs * num_exprs. Once we exceed it switch to the bitmap pass, otherwise keep sbitmaps for performance. Messaging with Juzhe offline, his best guess for the LICM time is that he enabled checking for dataflow which slows down this particular compilation by a lot. Therefore it doesn't look like a generic problem.
[Bug c/113474] RISC-V: Fail to use vmerge.vim for constant vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474 --- Comment #1 from Robin Dapp --- Good catch. Looks like the ifn expander always forces into a register. That's probably necessary on all targets except riscv. diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc index a07f25f3aee..e923051d540 100644 --- a/gcc/internal-fn.cc +++ b/gcc/internal-fn.cc @@ -3118,7 +3118,8 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall *stmt, convert_optab optab) rtx_op2 = expand_normal (op2); mask = force_reg (mask_mode, mask); - rtx_op1 = force_reg (mode, rtx_op1); + if (!insn_operand_matches (icode, 1, rtx_op1)) +rtx_op1 = force_reg (mode, rtx_op1); rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); create_output_operand ([0], target, mode); gives me: foo: .LFB0: .cfi_startproc ble a0,zero,.L5 sllia3,a0,3 add a3,a1,a3 vsetivlizero,4,e32,m1,ta,ma vmv.v.i v3,15 vmv.v.i v2,0 .L3: ld a5,0(a1) addia4,a5,4 addia5,a5,20 vle32.v v1,0(a5) vle32.v v0,0(a4) vmseq.vvv0,v0,v3 vmerge.vim v4,v2,1,v0 vse32.v v4,0(a4) vmseq.vvv0,v1,v3 addia1,a1,8 vmerge.vim v1,v2,1,v0 vse32.v v1,0(a5) bne a1,a3,.L3 .L5: ret
[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247 --- Comment #9 from Robin Dapp --- I also noticed this (likely unwanted) vector snippet and wondered where it is being created. First I thought it's a vec_extract but doesn't look like it. I'm going to check why we create this. Pan, the test was on real hardware I suppose? So regardless of the fact that we likely want to get rid of the snippet above, would you mind checking whether generic-ooo has any effect on performance? Maybe you could try -march=rv64gc -mtune=generic-ooo. Thanks.
[Bug middle-end/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971 --- Comment #22 from Robin Dapp --- Yes, going to the thread soon.
[Bug target/113249] RISC-V: regression testsuite errors -mtune=generic-ooo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113249 --- Comment #4 from Robin Dapp --- > One of the reasons I've been testing things with generic-ooo is because > generic-ooo had initial vector pipelines defined. For cleaning up the > scheduler, I copied over the generic-ooo pipelines into generic and sifive-7 > md files. As you mentioned, the scan dump fails are likely less optimal code > sequences for the as a result of the cost model. I'm planning on sending up > a patch in my series that adds -fno-schedule-insns -fno-schedule-insns2 to > the dump scan tests that fail but do you think it would be better to hard > code the tune instead? It's a bit difficult to say, actually both is not ideal but there is no ideal way anyway :) Disabling scheduling is probably fine for all the intrinsics tests because it can be argued that the expected output is very close to the input anyway. For others it might depend on the intention of the test. But, in order to get them out of the way, I think it should be ok to just disabling scheduling and take care of the intention of the test later.
[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247 --- Comment #4 from Robin Dapp --- The other option is to assert that all tune models have at least a vector cost model rather than NULL... But not falling back to the builtin costs still makes sense.
[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247 --- Comment #3 from Robin Dapp --- Yes, sure and I gave a bit of detail why the values chosen there (same as aarch64) make sense to me. Using this generic vector cost model by default without adjusting the latencies is possible. I would be OK with such a change but would also rather not have "rocket" at all by default ;)
[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247 --- Comment #1 from Robin Dapp --- Hmm, so I tried reproducing this and without a vector cost model we indeed vectorize. My qemu dynamic instruction count results are not as abysmal as yours but still bad enough (20-30% increase in dynamic instructions). However, as soon as I use the vector cost model, enabled by -mtune=generic-ooo, the sha256 function is not vectorized anymore: bla.c:95:5: note: Cost model analysis for part in loop 0: Vector cost: 294 Scalar cost: 185 bla.c:95:5: missed: not vectorized: vectorization is not profitable. Without that we have: bla.c:95:5: note: Cost model analysis for part in loop 0: Vector cost: 173 Scalar cost: 185 bla.c:95:5: note: Basic block will be vectorized using SLP (Those costs are obtained via default_builtin_vectorization_cost). The main difference is vec_to_scalar cost being 1 by default and 2 in our cost model, as well as vec_perm = 2. Given our limited permute capabilities I think a cost of 2 makes sense. We can also argue in favor of vec_to_scalar = 2 because we need to slide down elements for extraction and cannot extract directly. Setting scalar_to_vec = 2 is debatable and I'd rather keep it at 1. For the future we need to make a decision whether to continue with generic-ooo as the default vector model or if we want to set latencies to a few uniform values in order for scheduling not to introduce spilling and waiting for dependencies. To help with that decision you could run some benchmarks with the generic-ooo tuning and see if things get better or worse?
[Bug target/113281] [14] RISC-V rv64gcv_zvl256b vector: Runtime mismatch with rv64gc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281 --- Comment #2 from Robin Dapp --- Confirmed. Funny, we shouldn't vectorize that but really optimize to "return 0". Costing might be questionable but we also haven't optimized away the loop when comparing costs. Disregarding that, of course the vectorization should be correct. The vect output doesn't really make sense to me but I haven't looked very closely yet: _177 = .SELECT_VL (2, POLY_INT_CST [16, 16]); vect_patt_82.18_166 = (vector([16,16]) unsigned short) { 17, 18, 19, ... }; vect_patt_84.19_168 = MIN_EXPR ; vect_patt_85.20_170 = { 32872, ... } >> vect_patt_84.19_168; vect_patt_87.21_171 = VIEW_CONVERT_EXPR(vect_patt_85.20_170); _173 = _177 + 18446744073709551615; # RANGE [irange] short int [0, 16436] MASK 0x7fff VALUE 0x0 _174 = .VEC_EXTRACT (vect_patt_87.21_171, _173); vect_patt_85.20_170 should be all zeros and then we'd just vec_extract a 0 and return that. However, 32872 >> 15 == 1 so we return 1.
[Bug target/113249] RISC-V: regression testsuite errors -mtune=generic-ooo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113249 --- Comment #1 from Robin Dapp --- Yes, several (most?) of those are expected because the tests rely on the default latency model. One option is to hard code the tune in those tests. On the other hand the dump tests checking for a more or less optimal code sequence (under certain conditions and regardless of uarch of course) and deviation from that sequence might also indicate sub-optimal code. I commented on this a bit when first introducing generic-ooo. If there are new execution failures that would be more concerning and indicate a real bug.
[Bug target/112999] riscv: Infinite loop with mask extraction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999 Robin Dapp changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #4 from Robin Dapp --- Should be fixed on trunk.
[Bug target/112773] [14 Regression] RISC-V ICE: in force_align_down_and_div, at poly-int.h:1828 on rv32gcv_zvl256b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112773 --- Comment #16 from Robin Dapp --- I'd hope it was not fixed by this but just latent because we chose a VLS-mode vectorization instead. Hopefully we're better off with the fix than without :)
[Bug target/113014] RISC-V: Redundant zeroing instructions in reduction due to r14-3998-g6223ea766daf7c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014 --- Comment #4 from Robin Dapp --- Richard has posted it and asked for reviews. I have tested it and we have several testsuite regressions with it but no severe ones. Most or all of them are dump fails because we combine into vx variants that would be vv variants before. I replied to Richard's post mentioning that we would very much like to see that go in because it helps us generate the code we want. To me it appears very likely that it will land.
[Bug target/113014] RISC-V: Redundant zeroing instructions in reduction due to r14-3998-g6223ea766daf7c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014 --- Comment #2 from Robin Dapp --- Yes, that's right.
[Bug target/112999] riscv: Infinite loop with mask extraction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999 --- Comment #1 from Robin Dapp --- What actually gets in the way of vec_extract here is changing to a "better" vector mode (which is RVVMF4QI here). If we tried to extract from the mask directly everything would work directly. I have a patch locally that does this by refactoring extract_bit_field_1 slightly. Going to post it soon but not sure if people agree with that idea.
[Bug target/112999] New: riscv: Infinite loop with mask extraction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999 Bug ID: 112999 Summary: riscv: Infinite loop with mask extraction Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: juzhe.zhong at rivai dot ai, pan2.li at intel dot com Target Milestone: --- Target: riscv Pan Li found the following problematic case in his "full-coverage" testing and I'm just documenting it here for reference. /* { dg-do compile } */ /* { dg-options "-march=rv64gcv_zvl512b -mabi=lp64d --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -O3 -fno-vect-cost-model -fno-tree-loop-distribute-patterns" } */ int a[1024]; int b[1024]; _Bool fn1 () { _Bool tem; for (int i = 0; i < 1024; ++i) { tem = !a[i]; b[i] = tem; } return tem; } We try to extract the last bit from a 128-bit value of a mask vector. In order to do so we first subreg by a tieable vector mode (here RVVMF4QI) then, because we do not have a RVVMF4QI -> BI vector extraction, try type punning with a TImode subreg. As we do not natively support TImode, the result needs to be subreg'd again to DImode. In the course of doing so we get lost in subreg moves and hit an infinite loop. I have not tracked down the real root cause but the problem is fixed by providing a movti pattern and special casing subreg:TI extraction from vectors (just like we do in legitimize_move for other scalar subregs of vectors - and wich I don't particularly like either :) ).
[Bug middle-end/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971 --- Comment #8 from Robin Dapp --- Yes, can confirm that this helps.
[Bug target/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971 --- Comment #5 from Robin Dapp --- Yes that's what I just tried. No infinite loop anymore then. But that's not a new simplification and looks reasonable so there must be something special for our backend.
[Bug target/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971 --- Comment #3 from Robin Dapp --- In match.pd we do something like this: ;; Function e (e, funcdef_no=0, decl_uid=2751, cgraph_uid=1, symbol_order=4) Pass statistics of "forwprop": Matching expression match.pd:2771, gimple-match-2.cc:35 Matching expression match.pd:2774, gimple-match-1.cc:66 Matching expression match.pd:2781, gimple-match-2.cc:96 Aborting expression simplification due to deep recursion Aborting expression simplification due to deep recursion Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 Applying pattern match.pd:6784, gimple-match-5.cc:1742 gimple_simplified to _53 = { 0, ... } & { 8, 7, 6, ... }; _63 = { 0, ... } & { -9, -8, -7, ... }; _52 = { 0, ... } & { 8, 7, 6, ... }; _74 = { 0, ... } & { -9, -8, -7, ... }; _38 = { 0, ... } & { 8, 7, 6, ... }; _40 = { 0, ... } & { -9, -8, -7, ... }; _55 = { 0, ... } & { 8, 7, 6, ... }; _57 = { 0, ... } & { -9, -8, -7, ... }; _65 = { 0, ... } & { 8, 7, 6, ... }; _72 = { 0, ... } & { -9, -8, -7, ... }; _32 = { 0, ... } & { 8, 7, 6, ... }; mask__6.19_61 = _32 == { 0, ... }; That doesn't look particularly backend related but we're trying to simplify a mask so you never know...
[Bug target/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971 --- Comment #2 from Robin Dapp --- It doesn't look like the same issue to me. The other bug is related to TImode handling in combination with mask registers. I will also have a look at this one.
[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929 --- Comment #15 from Robin Dapp --- I think we need to make sure that we're not writing out of bounds. In that case anything might happen and if we just don't happen to overwrite this variable we might hit another one but the test can still pass "by accident". If my analysis is correct (it was just done very quickly) the vl should be 32 at that point and we should not write past that size. We could have printf output a larger chunk of memory. Maybe this way we could see whether something was clobbered even with the newer qemu.
[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853 --- Comment #10 from Robin Dapp --- I just realized that I forgot to post the comparison recently. With the patch now upstream I don't see any differences for zvl128b and different vlens anymore. What I haven't fully tested yet is zvl256b or higher with various vlens.
[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929 --- Comment #13 from Robin Dapp --- I just built from the most recent commit and it still fails for me. Could there be a difference in qemu? I'm on qemu-riscv64 version 8.1.91 but yours is even newer so that might not explain it. You could step through until the last vsetvl before the printf and check the vl after it (or the avl in a4). As we overwrite the stack it might lead to different outcomes on different environments.
[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929 --- Comment #9 from Robin Dapp --- In the good version the length is 32 here because directly before the vsetvl we have: li a4,32 That seems to get lost somehow.
[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929 --- Comment #7 from Robin Dapp --- Here 0x105c6 vse8.v v8,(a5) is where we overwrite m. The vl is 128 but the preceding vsetvl gets a4 = 46912504507016 as AVL which seems already borken.
[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929 --- Comment #6 from Robin Dapp --- This seems to be gone when simple vsetvl (instead of lazy) is used or with -fno-schedule-insns which might indicate a vsetvl pass problem. We might have a few more of those. Maybe it would make sense to run the testsuite with an RVV-enabled valgrind. But that might give more false negatives than real findings :/
[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853 --- Comment #8 from Robin Dapp --- With Juzhe's latest fix that disables VLS modes >= 128 bit for zvl128b x264 runs without issues here and some of the additional execution failures are gone. Will post the current comparison later.
[Bug middle-end/112872] [14 Regression] RISCV ICE: in store_integral_bit_field, at expmed.cc:1049 with -03 rv64gcv_zvl1024b --param=riscv-autovec-preference=fixed-vlmax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112872 --- Comment #2 from Robin Dapp --- Thanks. Yes that's similar and also looks fixed by the introduction of the vec_init expander. Added this test case to the patch and will push it soon.
[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853 --- Comment #7 from Robin Dapp --- Ah, forgot three tests: FAIL: gcc.dg/vect/bb-slp-cond-1.c execution test FAIL: gcc.dg/vect/bb-slp-pr101668.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/bb-slp-pr101668.c execution test On vlen=512 gfortran.dg/array_constructor_4.f90 gfortran.dg/vector_subscript_8.f90 gfortran.fortran-torture/execute/in-pack.f90 are gone again, the rest is similar. Are those the unstable ones?
[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853 --- Comment #6 from Robin Dapp --- I indeed see more failures with _zvl128b, vlen=256 (than with _zvl128b, vlen=128): FAIL: gcc.dg/vect/pr66251.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/pr66251.c execution test FAIL: gcc.dg/vect/pr66253.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/pr66253.c execution test FAIL: gcc.dg/vect/slp-46.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/slp-46.c execution test FAIL: gcc.dg/vect/vect-alias-check-10.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/vect-alias-check-10.c execution test FAIL: gcc.dg/vect/vect-alias-check-11.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/vect-alias-check-11.c execution test FAIL: gcc.dg/vect/vect-alias-check-12.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/vect-alias-check-12.c execution test FAIL: gcc.dg/vect/vect-alias-check-18.c -flto -ffat-lto-objects execution test FAIL: gcc.dg/vect/vect-alias-check-18.c execution test FAIL: gfortran.dg/array_constructor_4.f90 -O1 execution test FAIL: gfortran.dg/associate_18.f08 -O1 execution test FAIL: gfortran.dg/vector_subscript_8.f90 -O1 execution test FAIL: gfortran.dg/vector_subscript_8.f90 -O2 execution test FAIL: gfortran.dg/vector_subscript_8.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions execution test FAIL: gfortran.dg/vector_subscript_8.f90 -O3 -g execution test FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution, -O1 FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution, -O2 FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution, -O2 -fbounds-check FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution, -O2 -fomit-frame-pointer -finline-functions FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution, -O2 -fomit-frame-pointer -finline-functions -funroll-loops FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution, -O3 -g Maybe those can give a hint.