[Bug c/115104] RISC-V: GCC-14 can combine vsext+vadd -> vwadd but Trunk GCC (GCC 15) Failed

2024-05-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115104

--- Comment #2 from Robin Dapp  ---
Thanks, I was just about to open a PR.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-05-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #18 from Robin Dapp  ---
A bit of a follow-up:  I'm working on a patch for reassociation that can handle
the mentioned cases and some more but it will still require a bit of time to
get everything regression free and correct.  What it does is allow reassoc to
look through constant multiplications and negates to provide more freedom in
the optimization process.

Regarding the mentioned element-wise costing how should we proceed here?  I'm
going to remove the hunk in question, run SPEC2017 on x86 and post a patch in
order to get some data and basis for discussion.

[Bug middle-end/114196] [13 Regression] Fixed length vector ICE: in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9454

2024-05-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114196

--- Comment #7 from Robin Dapp  ---
I can barely build a compiler on gcc185 due to disk space.  I'm going to set up
a cross toolchain (that I need for other purposes as well) in order to test.

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #10 from Robin Dapp  ---
Yes it helps.  Great that get_gimple_for_ssa_name is right below
get_rtx_for_ssa_name that I stepped through several times while debugging and I
didn't realize the connection, g.

But thanks!  Good thing it can be solved like that.

I cannot do a bootstrap/regtest for aarch64 because cfarm185 is almost out of
disk space.  As the bug is old and very unlikely to trigger it can surely wait
for GCC15?

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #8 from Robin Dapp  ---
Created attachment 58037
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58037=edit
Expand dump

Dump attached.  Insn 209 is the problematic one.
The changing from _911 to 1078 happens in internal-fn.cc:expand_call_mem_ref
(and not via TER).
The lookup there is simple and I was also wondering if there is some
single_imm_use or so missing.

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

Robin Dapp  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org

--- Comment #6 from Robin Dapp  ---
This one is really a bit tricky.

We have the following situation:

loop:

# vectp_g.178_1078 = PHI  
_911 =  vectp_g.178_1078
MASK_LEN_LOAD (_911, ...);
vectp_g.178_1079 = vectp_g.178_1078 + 16;
goto loop;

:
MASK_LEN_LOAD (_911, ...);

During expand we basically convert back the _911 to vectp_g.178_1078 (reverting
what we did in ivopts before).  Because _911 camouflages vectp_g.178_1078 until
expand we evaded the conflict checks of outof-ssa that would catch a similar,
non-camouflaged situation like:

# vectp_g.178_1078 = PHI  
MASK_LEN_LOAD (MEM... vectp_g.178_1078, ...);
vectp_g.178_1079 = vectp_g.178_1078 + 16;
goto loop;
MASK_LEN_LOAD (MEM... vectp_g.178_1078, ...);

and would insert a copy of the definition right before the backedge.  The
MASK_LEN_LOAD after the loop would then use that copy.  By using _911 instead
of the original pointer no conflict is detected and we wrongly use the
incremented pointer.  Without the ivopt change for TARGET_MEM_REF

Unless I'm misunderstanding some basic mechanism it's not going to work like
that (and we could also have this situation on aarch64).  What could help is to
enhance trivially_conflicts_p in outof-ssa to catch such TARGET_MEM_REFs here
and handle them similarly to a normal conflict.  I did that locally and it
helps for this particular case but I'd rather not post it in its current hacky
state even if the riscv testsuite looks ok :)  Even if that were the correct
solution I'd doubt it should land in stage 4.

CC'ing Richard Sandiford as he originally introduced the ivopts and expand
handling.

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #5 from Robin Dapp  ---
What happens is that code sinking does:

Sinking # VUSE <.MEM_1235>
vect__173.251_1238 = .MASK_LEN_LOAD (_911, 32B, { -1, -1, -1, -1 },
loop_len_1064, 0);
 from bb 3 to bb 4

so we have

vect__173.251_1238 = .MASK_LEN_LOAD (_911, 32B, { -1, -1, -1, -1 },
loop_len_1064, 0);

after the loop.

When expanding this stmt expand_call_mem_ref creates a mem reference to
vectp_g.178 for _911 (== vectp_g.178_1078).  This is expanded to the same rtl
as vectp_g.178_1079 (which is incremented before the latch as opposed to
...1078 which is not).

Disabling sinking or expand_call_mem_ref both help but neither is correct of
course :)  I don't have a solution yet but I'd hope we're a bit closer to the
problem now.

[Bug target/114714] [RISC-V][RVV] ICE: insn does not satisfy its constraints (postreload)

2024-04-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114714

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #5 from Robin Dapp  ---
Did anybody do some further investigation here?  Juzhe messaged me that this PR
is the original reason for the reversal but I don't yet understand why the
register filters don't encompass the full semantics of RVV overlap.

I looked into the test case and what happens is that, in order to determine the
validity of the alternatives, riscv_get_v_regno_alignment is first being called
with an M2 mode.  Our destination is actually a (subreg:RVVM2SI (reg:RVVM4SI
...) 0), though.  I suppose lra/reload check whether a non-subreg destination
also works and hands us a (reg:RVVM4SI ...) as operand[0].  We pass this to
riscv_get_v_regno_alignment which, for an LMUL4 mode, returns 4, thus wrongly
enabling the W42 alternatives.
A W42 alternative permits hard regs % 4 == 2, which causes us to eventually
choose vr2 as destination and source.  Once the constraints are actually
checked we have a mismatch as none of the alternatives work.

Now I'm not at all sure how lra/reload use operand[0] here but this can surely
be found out.  A quick and dirty hack (attached) that checks the insn's
destination mode instead of operand[0]'s mode gets rid of the ICE and doesn't
cause regressions.

I suppose we're too far ahead with the reversal already but I'd really have
preferred more details.  Maybe somebody has had in-depth look but it just
wasn't posted yet?

--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -6034,6 +6034,22 @@ riscv_get_v_regno_alignment (machine_mode mode)
   return lmul;
 }

+int
+riscv_get_dest_alignment (rtx_insn *insn, rtx operand)
+{
+  const_rtx set = 0;
+  if (GET_CODE (PATTERN (insn)) == SET)
+{
+  set = PATTERN (insn);
+  rtx op = SET_DEST (set);
+  return riscv_get_v_regno_alignment (GET_MODE (op));
+}
+  else
+{
+  return riscv_get_v_regno_alignment (GET_MODE (operand));
+}
+}
+
 /* Define ASM_OUTPUT_OPCODE to do anything special before
emitting an opcode.  */
 const char *
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index ce1ee6b9c5e..5113daf2ac7 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -550,15 +550,15 @@ (define_attr "group_overlap_valid" "no,yes"
  (const_string "yes")

  (and (eq_attr "group_overlap" "W21")
- (match_test "riscv_get_v_regno_alignment (GET_MODE (operands[0]))
!= 2"))
+ (match_test "riscv_get_dest_alignment (insn, operands[0]) != 2"))
 (const_string "no")

  (and (eq_attr "group_overlap" "W42")
- (match_test "riscv_get_v_regno_alignment (GET_MODE (operands[0]))
!= 4"))
+ (match_test "riscv_get_dest_alignment (insn, operands[0]) != 4"))
 (const_string "no")

  (and (eq_attr "group_overlap" "W84")
- (match_test "riscv_get_v_regno_alignment (GET_MODE (operands[0]))
!= 8"))
+ (match_test "riscv_get_dest_alignment (insn, operands[0]) != 8"))
 (const_string "no")

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-16 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #4 from Robin Dapp  ---
Ok, it looks like we do 5 iterations with the last one being length-masked to
length 2 and then in the "live extraction" phase use "iteration 6".

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-16 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #3 from Robin Dapp  ---
> probably -fwhole-program is enough, -flto not needed(?)

Yes, -fwhole-program is sufficient.

> 
>   # vectp_g.248_1401 = PHI 
> ...
>   _1411 = .SELECT_VL (ivtmp_1409, POLY_INT_CST [2, 2]);
> ..
>   vect__193.250_1403 = .MASK_LEN_LOAD (vectp_g.248_1401, 32B, { -1, ... },
> _1411, 0);
>   vect__194.251_1404 = -vect__193.250_1403;
>   vect_iftmp.252_1405 = (vector([2,2]) long int) vect__194.251_1404;
> 
>   # vect_iftmp.252_1406 = PHI 
>   # loop_len_1427 = PHI <_1411(5)>
> ...
>   _1407 = loop_len_1427 + 18446744073709551615;
>   _1408 = .VEC_EXTRACT (vect_iftmp.252_1406, _1407);
>   iftmp.3_1204 = _1408;
> 
> is stored to b[15].  Doesn't look too odd to me.

At the assembly equivalent of

>   vect__193.250_1403 = .MASK_LEN_LOAD (vectp_g.248_1401, 32B, { -1, ... },
> _1411, 0); 

we load [3 3] (=f) instead of [0 0] (=g).  f is located after g in memory and
register a3 is increased before the loop latch.  We then re-use a3 to load the
last two elements of g but actually read the first two of f.

[Bug target/114734] [14] RISC-V rv64gcv_zvl256b miscompile with -flto -O3 -mrvv-vector-bits=zvl

2024-04-16 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734

--- Comment #1 from Robin Dapp  ---
Confirmed.

[Bug middle-end/114733] [14] Miscompile with -march=rv64gcv -O3 on riscv

2024-04-16 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114733

--- Comment #1 from Robin Dapp  ---
Confirmed, also shows up here.

[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665

--- Comment #5 from Robin Dapp  ---
Weird,  I tried your exact qemu version and still can't reproduce the problem.  
My results are always FFB5.

Binutils difference?  Very unlikely.  Could you post your QEMU_CPU settings
just to be sure?

[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668

Robin Dapp  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Robin Dapp  ---
I didn't have the time to fully investigate but the default path without vec
extract is definitely broken for masks.  I'd probably sleep better if we fixed
that at some point but for now the obvious fix is to add the missing expanders.

Patrick, I'm still unable to reproduce PR114665 (maybe also a qemu
difference?).  Could you re-check with this fix?  Thanks.

[Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension

2024-04-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686

--- Comment #3 from Robin Dapp  ---
I think we have always maintained that this can definitely be a per-uarch
default but shouldn't be a generic default.

> I don't see any reason why this wouldn't be the case for the vast majority of
> implementations, especially high performance ones would benefit from having
> more work to saturate the execution units with, since a larger LMUL works
> quite
> similar to loop unrolling.

One argument is reduced freedom for renaming and the out of order machinery. 
It's much easier to shuffle individual registers around than large blocks. 
Also lower-latency insns are easier to schedule than longer-latency ones and
faults, rejects, aborts etc. get proportionally more expensive.
I was under the impression that unrolling doesn't help a whole lot (sometimes
even slows things down a bit) on modern cores and certainly is not
unconditionally helpful.  Granted, I haven't seen a lot of data on it recently.
An exception is of course breaking dependency chains.

In general nothing stands in the way of having a particular tune target use
dynamic LMUL by default even now but nobody went ahead and posted a patch for
theirs.  One could maybe argue that it should be the default for in-order
uarchs?

Should it become obvious in the future that LMUL > 1 is indeed,
unconditionally, a "better unrolling" because of its favorable icache footprint
and other properties (which I doubt - happy to be proved wrong) then we will
surely re-evaluation the decision or rather have a different consensus.

The data we publicly have so far is all in-order cores and my expectation is
that the picture will change once out-of-order cores hit the scene.

[Bug target/114668] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114668

--- Comment #2 from Robin Dapp  ---
This, again, seems to be a problem with bit extraction from masks.
For some reason I didn't add the VLS modes to the corresponding vec_extract
patterns.  With those in place the problem is gone because we go through the
expander which does the right thing.

I'm still checking what exactly goes wrong without those as there is likely a
latent bug.

[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665

--- Comment #2 from Robin Dapp  ---
Checked with the latest commit on a different machine but still cannot
reproduce the error.  PR114668 I can reproduce.  Maybe a copy and paste
problem?

[Bug target/114665] [14] RISC-V rv64gcv: miscompile at -O3

2024-04-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114665

--- Comment #1 from Robin Dapp  ---
Hmm, my local version is a bit older and seems to give the same result for both
-O2 and -O3.  At least a good starting point for bisection then.

[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA

2024-04-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247

--- Comment #6 from Robin Dapp  ---
Testsuite looks unchanged on rv64gcv.

[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA

2024-04-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247

--- Comment #5 from Robin Dapp  ---
This fixes the test case for me locally, thanks.
I can run the testsuite with it later if you'd like.

[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vect-cost-model (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)

2024-04-03 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476

--- Comment #8 from Robin Dapp  ---
I tried some things (for the related bug without -fwrapv) then got busy with
some other things.  I'm going to have another look later this week.

[Bug rtl-optimization/114515] [14 Regression] Failure to use aarch64 lane forms after PR101523

2024-04-02 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515

Robin Dapp  changed:

   What|Removed |Added

 CC||ewlu at rivosinc dot com,
   ||rdapp at gcc dot gnu.org

--- Comment #7 from Robin Dapp  ---
There is some riscv fallout as well.  Edwin has the details.

[Bug tree-optimization/114485] [13/14 Regression] Wrong code with -O3 -march=rv64gcv on riscv or `-O3 -march=armv9-a` for aarch64

2024-03-27 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114485

--- Comment #4 from Robin Dapp  ---
Yes, the vectorization looks ok.  The extracted live values are not used
afterwards and therefore the whole vectorized loop is being thrown away.
Then we do one iteration of the epilogue loop, inverting the original c and end
up with -8 instead of 8.  This is pretty similar to what's happening in the
related PR.

We properly populate the phi in question in slpeel_update_phi_nodes_for_guard1:

c_lsm.7_64 = PHI <_56(23), pretmp_34(17)>

but vect_update_ivs_after_vectorizer changes that into

c_lsm.7_64 = PHI .

Just as a test, commenting out

  if (!LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
  update_e);

at least makes us keep the VEC_EXTRACT and not fail anymore.

[Bug tree-optimization/114476] [13/14 Regression] wrong code with -fwrapv -O3 -fno-vector-cost-mode (and -march=armv9-a+sve2 on aarch64 and -march=rv64gcv on riscv)

2024-03-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114476

--- Comment #5 from Robin Dapp  ---
So the result is -9 instead of 9 (or vice versa) and this happens (just) with
vectorization.  We only vectorize with -fwrapv.

>From a first quick look, the following is what we have before vect:

(loop)
   [local count: 991171080]:
  ...
  # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)>
  ...
  _4 = -b_lsm.5_5;
(check)
 [local count: 82570744]:
  ...
  # b_lsm.5_22 = PHI 
  ...
  if (b_lsm.5_22 != -9)

I.e. b gets negated with every iteration and we check the second to last
against -9.

With vectorization we have:
(init)
   [local count: 82570744]:
  b_lsm.5_17 = b;

(vectorized loop)
   [local count: 247712231]:
  ...
  # b_lsm.5_5 = PHI <_4(7), b_lsm.5_17(2)>
  ...
_4 = -b_lsm.5_5;
  ...
  goto 

(epilogue)
   [local count: 82570741]:
  ...
  # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)>
  ...
  _25 = -b_lsm.5_7;

(check)
   [local count: 82570744]:
  ...
  # b_lsm.5_22 = PHI 
  if (b_lsm.5_22 != -9)

What looks odd here is that b_lsm.5_7's fallthrough argument is b_lsm.5_17 even
though we must have come through the vectorized loop (which negated b at least
once).  This makes us skip inversions.
Indeed, as b_lsm.5_22 is only dependent on the initial value of b it gets
optimized away and we compare b != -9.

Maybe I missed something but it looks like
  # b_lsm.5_7 = PHI <_25(11), b_lsm.5_17(13)>
should have b_lsm.5_5 or _4 as fallthrough argument.

[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv

2024-03-20 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #8 from Robin Dapp  ---
No fallout on x86 or aarch64.

Of course using false instead of TYPE_SIGN (utype) is also possible and maybe
clearer?

[Bug tree-optimization/114396] [14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #7 from Robin Dapp  ---
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 4375ebdcb49..f8f7ba0ccc1 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -9454,7 +9454,7 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree
init_expr,
wi::to_mpz (skipn, exp, UNSIGNED);
mpz_ui_pow_ui (mod, 2, TYPE_PRECISION (type));
mpz_powm (res, base, exp, mod);
-   begin = wi::from_mpz (type, res, TYPE_SIGN (type));
+   begin = wi::from_mpz (type, res, TYPE_SIGN (utype));
tree mult_expr = wide_int_to_tree (utype, begin);
init_expr = gimple_build (stmts, MULT_EXPR, utype,
  init_expr, mult_expr);

This helps for the test case.

[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

--- Comment #3 from Robin Dapp  ---
-O3 -mavx2 -fno-vect-cost-model -fwrapv seems to be sufficient.

[Bug target/114396] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3 with -fwrapv

2024-03-19 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396

Robin Dapp  changed:

   What|Removed |Added

 Target|riscv*-*-*  |x86_64-*-* riscv*-*-*

--- Comment #2 from Robin Dapp  ---
At first glance it doesn't really look like a target issue.

Tried it on x86 and it fails as well with
-O3 -march=native pr114396.c -fno-vect-cost-model -fwrapv

short a = 0xF;
short b[16];

int main() {
for (int e = 0; e < 9; e += 1)
b[e] = a *= 0x5;

if (a != 2283)
__builtin_abort ();
}

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #29 from Robin Dapp  ---
Yes, that also appears to work here.  There was no lto involved this time?
Now we need to figure out what's different with SPEC.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #27 from Robin Dapp  ---
Can you try it with a simpler (non SPEC) test?  Maybe there is still something
weird happening with SPEC's scripting.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #24 from Robin Dapp  ---
I rebuilt GCC from scratch with your options but still have the same problem. 
Could our sources differ?  My SPEC version might not be the most recent but I'm
not aware that mcf changed at some point.

Just to be sure: I'm using r14-5075-gc05f748218a0d5 as the "before" commit.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #22 from Robin Dapp  ---
Still the same problem unfortunately.

I'm a bit out of ideas - maybe your compiler executables could help?

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #20 from Robin Dapp  ---
No change with -std=gnu99 unfortunately.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #18 from Robin Dapp  ---
Hmm, doesn't help unfortunately.  A full command line for me looks like:

x86_64-pc-linux-gnu-gcc -c -o pbeampp.o -DSPEC_CPU -DNDEBUG  -DWANT_STDC_PROTO 
-Ofast -march=znver4 -mtune=znver4 -flto=32 -g -fprofile-use=/tmp
-SPEC_CPU_LP64 pbeampp.c.

Could you verify if it's exactly the same for you?  Maybe it would also help if
you explicitly specified znver4?

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #16 from Robin Dapp  ---
Thank you!

I'm having a problem with the data, though.
Compiling with -Ofast -march=znver4 -mtune=znver4 -flto -fprofile-use=/tmp.
Would you mind showing your exact final options for compilation of e.g.
pbeampp.cc?

I see, similar-ish for both commits:
pbeampp.c:119:8: error: number of counters in profile data for function
'primal_bea_mpp' does not match its profile data (counter 'arcs', expected 20
and have 22) [-Werror=coverage-mismatch]

output.c:87:1: error: corrupted profile info: number of executions for edge 3-4
thought to be 1
output.c:87:1: error: corrupted profile info: number of executions for edge 3-5
thought to be -1
output.c:87:1: error: corrupted profile info: number of iterations for basic
block 5 thought to be -1

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #10 from Robin Dapp  ---
(In reply to Sam James from comment #9)
> (In reply to Filip Kastl from comment #8)
> > I'd like to help but I'm afraid I cannot send you the SPEC binaries with PGO
> > applied since SPEC is licensed nor can I give you access to a Zen4 computer.
> > I suppose someone else will have to analyze this bug.
> 
> Could you perhaps send only the gcda files so Robin can build again with
> -fprofile-use?

Yes, that would be helpful.

Or Filip builds the executables himself and posts (some of) the difference
here.  Maybe that also gets us a bit closer to the problem.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #7 from Robin Dapp  ---
I built executables with and without the commit (-Ofast -march=znver4 -flto). 
There is no difference so it must really be something that happens with PGO.
I'd really need access to a zen4 box or the pgo executables at least.

[Bug target/114202] [14] RISC-V rv64gcv: miscompile at -O3

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114202

Robin Dapp  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Robin Dapp  ---
Same as PR114200.

*** This bug has been marked as a duplicate of bug 114200 ***

[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200

--- Comment #3 from Robin Dapp  ---
*** Bug 114202 has been marked as a duplicate of this bug. ***

[Bug middle-end/114196] [13/14 Regression] Fixed length vector ICE: in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9454

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114196

Robin Dapp  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=113163

--- Comment #2 from Robin Dapp  ---
To me this looks like it already came up in the context of early-break
vectorization (PR113163) but is not actually dependent on it.  I'm testing a
patch that disables epilogue peeling also without early break.

[Bug target/114200] [14] RISC-V fixed-length vector miscompile at -O3

2024-03-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114200

--- Comment #1 from Robin Dapp  ---
Took me a while to analyze this... needed more time than I'd like to admit to
make sense of the somewhat weird code created by fully unrolling and peeling.

I believe the problem is that we reload the output register of a vfmacc/fma via
vmv.v.v (subject to length masking) but we should be using vmv1r.v.  The result
is used by a reduction which always operates on the full length.  As annoying
as it was to find - it's definitely a good catch.

I'm testing a patch.  PR114202 is indeed a duplicate.  Going to add its test
case to the patch.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-04 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #6 from Robin Dapp  ---
Honestly, I don't know how to analyze/debug this without a zen4, in particular
as it only seems to happen with PGO.  I tried locally but of course the
execution time doesn't change (same as with zen3 according to the database).
Is there a way to obtain the binaries in order to tell a difference?

[Bug middle-end/114109] x264 satd vectorization vs LLVM

2024-02-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #4 from Robin Dapp  ---
Yes, as mentioned, vectorization of the first loop is debatable.

[Bug middle-end/114109] x264 satd vectorization vs LLVM

2024-02-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

--- Comment #2 from Robin Dapp  ---
It is vectorized with a higher zvl, e.g. zvl512b, refer
https://godbolt.org/z/vbfjYn5Kd.

[Bug middle-end/114109] New: x264 satd vectorization vs LLVM

2024-02-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109

Bug ID: 114109
   Summary: x264 satd vectorization vs LLVM
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: enhancement
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org
  Target Milestone: ---
Target: x86_64-*-* riscv*-*-*

Looking at the following code of x264 (SPEC 2017):

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

static inline uint32_t abs2 (uint32_t a)
{
uint32_t s = ((a >> 15) & 0x10001) * 0x;
return (a + s) ^ s;
}

int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
{
uint32_t tmp[4][4];
uint32_t a0, a1, a2, a3;
int sum = 0;

for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
{
a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
{
  int t0 = a0 + a1;
  int t1 = a0 - a1;
  int t2 = a2 + a3;
  int t3 = a2 - a3;
  tmp[i][0] = t0 + t2;
  tmp[i][1] = t1 + t3;
  tmp[i][2] = t0 - t2;
  tmp[i][3] = t1 - t3;
};
}
for( int i = 0; i < 4; i++ )
{
{ int t0 = tmp[0][i] + tmp[1][i];
  int t1 = tmp[0][i] - tmp[1][i];
  int t2 = tmp[2][i] + tmp[3][i];
  int t3 = tmp[2][i] - tmp[3][i];
  a0 = t0 + t2;
  a2 = t0 - t2;
  a1 = t1 + t3;
  a3 = t1 - t3;
};
sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3);
}
return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1;
}

I first checked on riscv but x86 and aarch64 are pretty similar.  (Refer
https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f)

Vectorizing the first loop seems to be a costing issue.  By default we don't
vectorize and the code becomes much larger when disabling vector costing, so
the costing decision in itself seems correct.
Clang's version is significantly shorter and it looks like it just directly
vec_sets/vec_inits the individual elements.  On riscv it can be handled rather
elegantly with strided loads that we don't emit right now.
As there are only 4 active vector elements and the loop is likely load bound it
might be debatable whether LLVM's version is better?

The second loop we do vectorize (4 elements at a time) but end up with e.g.
four XORs for the four inlined abs2 calls while clang chooses a larger
vectorization factor and does all the xors in one.

On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM)
but I guess the general case is still interesting?

[Bug target/114028] [14] RISC-V rv64gcv_zvl256b: miscompile at -O3

2024-02-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114028

--- Comment #2 from Robin Dapp  ---
This is a target issue.  It looks like we try to construct a "superword"
sequence when the element size is already == Pmode.  Testing a patch.

[Bug target/114027] [14] RISC-V vector: miscompile at -O3

2024-02-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027

--- Comment #9 from Robin Dapp  ---
Argh,  I actually just did a gcc -O3 -march=native pr114027.c
-fno-vect-cost-model on cfarm188 with a recent-ish GCC but realized that I used
my slightly modified version and not the original test case.

long a;
int b[10][8] = {{},
{},
{},
{},
{},
{},
{0, 0, 0, 0, 0, 1, 1},
{1, 1, 1, 1, 1, 1, 1},
{1, 1, 1, 1, 1, 1, 1}};
int c;
int main() {
int d;
c = 0x;
for (; a < 6; a++) {
d = 0;
for (; d < 6; d++) {
c ^= -3L;
if (b[a + 3][d])
continue;
c = 0;
}
}

if (c == -3) {
return 0;
} else {
return 1;
}
}

This was from an initial attempt to minimize it further but I didn't really
verify if I'm breaking the test case by that (or causing undefined behavior).

With that I get a "1" with default options and "0" with -fno-tree-vectorize.
Maybe my snippet is broken then?

[Bug target/114027] [14] RISC-V vector: miscompile at -O3

2024-02-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114027

Robin Dapp  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org
   Last reconfirmed||2024-2-22
 Target|riscv   |x86_64-*-* riscv*-*-*
   ||aarch64-*-*

--- Comment #5 from Robin Dapp  ---
To me it looks like we interpret e.g. c_53 = _43 ? prephitmp_13 : 0 as the only
reduction statement and simplify to MAX because of the wrong assumption that
this is the only reduction statement in the chain when we actually have
several. 
(See "condition expression based on compile time constant").

--- Comment #6 from Robin Dapp  ---
Btw this fails on x86 and aarch64 for me with -fno-vect-cost-model.  So it
definitely looks generic.

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-02-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

--- Comment #4 from Robin Dapp  ---
Judging by the graph it looks like it was slow before, then got faster and now
slower again.  Is there some more info on why it got faster in the first place?
 Did the patch reverse something or is it rather a secondary effect?  I don't
have a zen4 handy to check.

[Bug target/113827] MrBayes benchmark redundant load on riscv

2024-02-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

--- Comment #1 from Robin Dapp  ---
x86 (-march=native -O3 on an i7 12th gen) looks pretty similar:

.L3:
movq(%rdi), %rax
vmovups (%rax), %xmm1
vdivps  %xmm0, %xmm1, %xmm1
vmovups %xmm1, (%rax)
addq$16, %rax
movq%rax, (%rdi)
addq$8, %rdi
cmpq%rdi, %rdx
jne .L3

So probably not target specific.  Costing?

[Bug target/113827] New: MrBayes benchmark redundant load

2024-02-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113827

Bug ID: 113827
   Summary: MrBayes benchmark redundant load
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org,
pan2.li at intel dot com
Blocks: 79704
  Target Milestone: ---
Target: riscv

A hot block in the MrBayes benchmark (as used in the Phoronix testsuite) has a
redundant scalar load when vectorized.

Minimal example, compiled with -march=rv64gcv -O3

int foo (float **a, float f, int n)
{
  for (int i = 0; i < n; i++)
{
  a[i][0] /= f;
  a[i][1] /= f;
  a[i][2] /= f;
  a[i][3] /= f;
  a[i] += 4;
}
}

GCC:
.L3:
ld  a5,0(a0)
vle32.v v1,0(a5)
vfmul.vvv1,v1,v2
vse32.v v1,0(a5)
addia5,a5,16
sd  a5,0(a0)
addia0,a0,8
bne a0,a4,.L3

The value of a5 doesn't change after the store to 0(a0).

LLVM:
.L3
vle32.v   v8,(a1)
addi  a3,a1,16
sda3,0(a2)
vfdiv.vf  v8,v8,fa5
addi  a2,a2,8
vse32.v   v8,(a1)
bne   a2,a0,.L3


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704
[Bug 79704] [meta-bug] Phoronix Test Suite compiler performance issues

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-31 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #23 from Robin Dapp  ---
> this is:
> 
> _429 = mask_patt_205.47_276[i] ? vect_cst__262[i] : (vect_cst__262 <<
> {0,..})[i];
> vect_iftmp.55_287 = mask_patt_209.54_286[i] ? _429 [i] : vect_cst__262[i]

But isn't it rather
_429 = mask_patt_205.47_276[i] ? (vect_cst__262[i] << vect_cst__262[i]) :
{0,..})[i]?

The else should be the last operand, shouldn't it?

On aarch64 we don't seem to emit a COND_SHL therefore this particular situation
does not occur.

However the simplification was introduced for aarch64:

(for cond_op (COND_BINARY)
 (simplify
  (vec_cond @0
   (cond_op:s @1 @2 @3 @4) @3)
  (cond_op (bit_and @1 @0) @2 @3 @4)))

It is supposed to simplify (in gcc.target/aarch64/sve/pre_cond_share_1.c)

  _256 = .COND_MUL (mask__108.48_193, vect_iftmp.45_187, vect_cst__190, { 0.0,
... });
  vect_prephitmp_151.50_197 = VEC_COND_EXPR ;

into COND_MUL (mask108 & mask101, vect_iftmp.45_187, vect_cst__190, { 0.0, ...
});

But that doesn't look valid to me either.  No matter what _256 is, the result
for !mask101 should be vect_cst__190 and not 0.0.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-30 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #19 from Robin Dapp  ---
What seems odd to me is that in fre5 we simplify

  _429 = .COND_SHL (mask_patt_205.47_276, vect_cst__262, vect_cst__262, { 0,
... });
  vect_prephitmp_129.51_282 = _429;
  vect_iftmp.55_287 = VEC_COND_EXPR ;

to

Applying pattern match.pd:9607, gimple-match-10.cc:3817
gimple_simplified to vect_iftmp.55_287 = .COND_SHL (mask_patt_205.47_276,
vect_cst__262, vect_cst__262, { 0, ... });

so fold

vec_cond (mask209, prephitmp129, vect_cst262)
with prephitmp129 = cond_shl (mask205, vect_cst262, vect_cst262, 0)

into
cond_shl = (mask205, vect_cst262, vect_cst262, 0)?

That doesn't look valid to me because the vec_cond's else value (vect_cst262)
gets lost.  Wouldn't such a simplification have a conditional else value?
Like !mask1 ? else1 : else2 instead of else2 unconditionally?

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-29 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #18 from Robin Dapp  ---
Hehe no it doesn't make sense...  I wrongly read a v2 as a v1.  Please
disregard the last message.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-29 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #17 from Robin Dapp  ---
Grasping for straws by blaming qemu ;)

At some point we do the vector shift

vsll.vv v1,v2,v2,v0.t

but the mask v0 is all zeros:
gdb:
   b = {0 }

According to the mask-undisturbed policy set before
vsetvli zero,zero,e32,mf2,ta,mu

all elements should be unchanged.  I'm seeing an all-zeros result in v1,
though.
v1 is used as 'j', is zero and therefore 'q' is not incremented and we don't
assign c = d causing the wrong result.

Before the shift I see v2 in gdb as:
  w = {4294967295, 4294967295, 0, 0}
(That's also a bit dubious because we load 2 elements from 'g' of which only
one should be -1.  This doesn't change the end result, though.)

After the shift gdb shows v1 as:
   w = {0, 0, 0, 0},

when it should be w = {-1, -1, 0, 0}.

Does this make sense?

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-29 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #16 from Robin Dapp  ---
Disabling vec_extract makes us operate on non-partial vectors, though so there
are a lot of differences in codegen.  I'm going to have a look.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #9 from Robin Dapp  ---
(In reply to rguent...@suse.de from comment #6)

> t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
> 
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
> 
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> faster code in the end).

I suppose you're referring to this?

  /* FIXME: At the moment the cost model seems to underestimate the
 cost of using elementwise accesses.  This check preserves the
 traditional behavior until that can be fixed.  */
  stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
  if (!first_stmt_info)
first_stmt_info = stmt_info;
  if (*memory_access_type == VMAT_ELEMENTWISE
  && !STMT_VINFO_STRIDED_P (first_stmt_info)
  && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
   && !DR_GROUP_NEXT_ELEMENT (stmt_info)
   && !pow2p_hwi (DR_GROUP_SIZE (stmt_info
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "not falling back to elementwise accesses\n");
  return false;
}


I did some more tests on my laptop.  As said above the whole loop in lbm is
larger and contains two ifs.  The first one prevents clang and GCC from
vectorizing the loop, the second one

if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
ux = 0.005;
uy = 0.002;
uz = 0.000;
}

seems to be if-converted? by clang or at least doesn't inhibit vectorization.

Now if I comment out the first, larger if clang does vectorize the loop.  With
the return false commented out in the above GCC snippet GCC also vectorizes,
but only when both ifs are commented out.

Results (with both ifs commented out), -march=native (resulting in avx2), best
of 3 as lbm is notoriously fickle:

gcc trunk vanilla: 156.04s
gcc trunk with elementwise: 132.10s
clang 17: 143.06s

Of course even the comment already said that costing is difficult and the
change will surely cause regressions elsewhere.  However the 15% improvement
with vectorization (or the 9% improvement of clang) IMHO show that it's surely
useful to look into this further.  On top, the riscv clang seems to not care
about the first if either and still vectorize.  I haven't looked closer what
happens there, though.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #10 from Robin Dapp  ---
The compile farm machine I'm using doesn't have SVE.
Compiling with -march=armv8-a -O3 pr113607.c -fno-vect-cost-model and running
it returns 0 (i.e. ok).

pr113607.c:35:5: note: vectorized 3 loops in function.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-26 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #7 from Robin Dapp  ---
Yep, that one fails for me now, thanks.

[Bug target/113607] [14] RISC-V rv64gcv vector: Runtime mismatch at -O3

2024-01-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113607

--- Comment #4 from Robin Dapp  ---
I cannot reproduce it either, tried with -ftree-vectorize as well as
-fno-vect-cost-model.

[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)

2024-01-25 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575

--- Comment #14 from Robin Dapp  ---
Ok, running tests with the adjusted version and going to post a patch
afterwards.

However, during a recent run compiling insn-recog took 2G and insn-emit-7 as
well as insn-emit-10 required > 1.5G each.  Looks like they could cause
problems as well then?  The insn-emit files can be split into 20 instead of 10
which might help but insn-recog I haven't had a look at yet.

[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)

2024-01-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575

--- Comment #12 from Robin Dapp  ---
Created attachment 57209
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57209=edit
Tentative

I tested the attached "fix".  On my machine with 13.2 host compiler it reduced
the build time for insn-opinit.cc from > 4 mins to < 2 mins and the memory
usage from >1G to 600ish M.  I didn't observe 3.5G before, though.

For now I just went with an arbitrary threshold of 5000 patterns and splitting
into 10 functions.  After testing on x86 and aarch64 I realized that both have
<3000 patterns so right now it would only split riscv's init function.

Or rather the other way, i.e. splitting into fixed-size chunks (of 1000)
instead?

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #2 from Robin Dapp  ---
> It's interesting, for Clang only RISC-V can vectorize it.

The full loop can be vectorized on clang x86 as well when I remove the first
conditional (which is not in the snippet I posted above).  So that's likely a
different issue than the loop itself.

[Bug tree-optimization/113583] New: Main loop in 519.lbm not vectorized.

2024-01-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

Bug ID: 113583
   Summary: Main loop in 519.lbm not vectorized.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
Target: x86_64-*-* riscv*-*-*

This might be a known issue but a bugzilla search regarding lbm didn't show
anything related.

The main loop in SPEC2017 519.lbm GCC riscv does not vectorize while clang
does.  For x86 neither clang nor GCC seem to vectorize it.

A (not entirely minimal but let's start somewhere) example is the following. 
This one is, however, vectorized by clang-17 x86 and not by GCC trunk x86 or
other targets I checked.

#define CST1 (1.0 / 3.0)

typedef enum
{
  C = 0,
  N, S, E, W, T, B, NW,
  NE, A, BB, CC, D, EE, FF, GG,
  HH, II, JJ, FLAGS, NN
} CELL_ENTRIES;

#define SX 100
#define SY 100
#define SZ 130

#define CALC_INDEX(x, y, z, e) ((e) + NN * ((x) + (y) * SX + (z) * SX * SY))

#define GRID_ENTRY_SWEEP(g, dx, dy, dz, e) ((g)[CALC_INDEX (dx, dy, dz, e) +
(i)])

#define LOCAL(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e))
#define NEIGHBOR_C(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e))
#define NEIGHBOR_S(g, e) (GRID_ENTRY_SWEEP (g, 0, -1, 0, e))
#define NEIGHBOR_N(g, e) (GRID_ENTRY_SWEEP (g, 0, +1, 0, e))
#define NEIGHBOR_E(g, e) (GRID_ENTRY_SWEEP (g, +1, 0, 0, e))

#define SRC_C(g) (LOCAL (g, C))
#define SRC_N(g) (LOCAL (g, N))
#define SRC_S(g) (LOCAL (g, S))
#define SRC_E(g) (LOCAL (g, E))
#define SRC_W(g) (LOCAL (g, W))

#define DST_C(g) (NEIGHBOR_C (g, C))
#define DST_N(g) (NEIGHBOR_N (g, N))
#define DST_S(g) (NEIGHBOR_S (g, S))
#define DST_E(g) (NEIGHBOR_E (g, E))

typedef double arr[SX * SY * SZ * NN];

#define OMEGA 0.123

void
foo (arr src, arr dst)
{
  double ux, uy, u2;
  const double lambda0 = 1.0 / (0.5 + 3.0 / (16.0 * (1.0 / OMEGA - 0.5)));
  double fs[NN], fa[NN], feqs[NN], feqa[NN];

  for (int i = 0; i < SX * SY * SZ * NN; i += NN)
{
  ux = 1.0;
  uy = 1.0;

  feqs[C] = CST1 * (1.0);
  feqs[N] = feqs[S] = CST1 * (1.0 + 4.5 * (+uy) * (+uy));

  feqa[C] = 0.0;
  feqa[N] = 0.2;

  fs[C] = SRC_C (src);
  fs[N] = fs[S] = 0.5 * (SRC_N (src) + SRC_S (src));

  fa[C] = 0.0;
  fa[N] = 0.1;

  DST_C (dst) = SRC_C (src) - OMEGA * (fs[C] - feqs[C]);
  DST_N (dst)
= SRC_N (src) - OMEGA * (fs[N] - feqs[N]) - lambda0 * (fa[N] -
feqa[N]);
}
}



missed.c:19:2: note:   ==> examining statement: _4 = *_3;
missed.c:19:2: missed:   no array mode for V8DF[20]
missed.c:19:2: missed:   no array mode for V8DF[20]
missed.c:19:2: missed:   the size of the group of accesses is not a power of 2
or not equal to 3
missed.c:19:2: missed:   not falling back to elementwise accesses
missed.c:43:11: missed:   not vectorized: relevant stmt not supported: _4 =
*_3;


Also refer to https://godbolt.org/z/P517qc3Yf for riscv and
https://godbolt.org/z/M134KvEEo for aarch64.  For aarch64 it seems clang would
vectorize the snippet but does not consider it profitable to do so.

For riscv and the full lbm workload I roughly see one third the number of
dynamically executed qemu instructions with the clang build vs GCC build, 340
billion vs 1200 billion.

[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)

2024-01-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575

--- Comment #7 from Robin Dapp  ---
Ok, I'm going to check.

[Bug other/113575] [14 Regression] memory hog building insn-opinit.o (i686-linux-gnu -> riscv64-linux-gnu)

2024-01-24 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113575

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org

--- Comment #5 from Robin Dapp  ---
Yes, this is a known issue and it's due to our large number of patterns. 
Contrary to insn-emit insn-opinit cannot be split that easily.  It would
probably need a tree-like approach or similar.
I wouldn't see this as a regression in the classical sense as we just have many
more patterns because of the vector extension.
Is increasing the available memory an option in the meantime or does this
urgently require fixing?

[Bug target/113570] RISC-V: SPEC2017 549 fotonik3d miscompilation in autovec VLS 256 build

2024-01-23 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113570

--- Comment #2 from Robin Dapp  ---
I'm pretty certain this is "works as intended" and -Ofast causes the precision
to be different than with -O3 (and dependant on the target).  See also:


It has been reported that with gfortran -Ofast -march=native verification
errors may be seen, for example:


*** Miscompare of pscyee.out; for details see
   
/data2/johnh/out.v1.1.5/benchspec/CPU/549.fotonik3d_r/run/run_base_refrate_Ofastnative./pscyee.out.mis
0646:   -1.91273086037953E-17, -1.46491401919706E-15,
-1.91273086057460E-17, -1.46491401919687E-15,
^
0668:   -1.91251317582607E-17, -1.42348205527085E-15,
-1.91251317602571E-17, -1.42348205527068E-15,
^

The errors may occur with other compilers as well, depending on your particular
compiler version, hardware platform, and optimization options.

The problem arises when a compiler chooses to vectorize a particular loop from
power.F90 line number 369

369   do ifreq = 1, tmppower%nofreq
370 frequency(ifreq,ipower) = freq
371 freq = freq + freqstep
372   end do



from https://www.spec.org/cpu2017/Docs/benchmarks/549.fotonik3d_r.html
which further states:


Workaround: You will need to specify optimization options that do not cause
this loop to be vectorized. For example, on a particular platform studied in
mid-2020 using GCC 10.2, these results were seen:

OK -Ofast -march=native -fno-unsafe-math-optimization 

If you apply one of the above workarounds in base, be sure to obey the
same-for-all rule which requires that all benchmarks in a suite of a given
language must use the same flags. For example, the sections below turn off
unsafe math optimizations for all Fortran modules in the floating point rate
and floating point speed benchmark suites:

default=base: 
  OPTIMIZE   = -Ofast -flto -march=native 
fprate,fpspeed=base:
  FOPTIMIZE  = -fno-unsafe-math-optimizations

[Bug testsuite/113558] [14 regression] gcc.dg/vect/vect-outer-4c-big-array.c etc. FAIL

2024-01-23 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113558

--- Comment #2 from Robin Dapp  ---
Created attachment 57195
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57195=edit
Tentative patch

Ah, it looks like nothing is being vectorized at all and the second check just
happened to match as part of the unsuccessful vectorization attempt.  It would
seem that we need the same condition as for the first check as well.

Would you mind giving the attached patch a try?  I ran it on riscv and power10
so far, x86 and aarch64 are still in progress.

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2024-01-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #38 from Robin Dapp  ---
deepsjeng also looks ok here.

[Bug target/113087] [14] RISC-V rv64gcv vector: Runtime mismatch with rv64gc

2024-01-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113087

--- Comment #37 from Robin Dapp  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113206#c9
> Using 4a0a8dc1b88408222b88e10278017189f6144602, the spec run failed on:
> zvl128b (All runtime fails):
> 527.cam4 (Runtime)
> 531.deepsjeng (Runtime)
> 521.wrf (Runtime)
> 523.xalancbmk (Runtime)

I tried reproducing the xalanc fail first but with the current trunk I don't
see a runtime fail.  Going to try deepsjeng next.

[Bug rtl-optimization/113495] RISC-V: Time and memory awful consumption of SPEC2017 wrf benchmark

2024-01-22 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113495

--- Comment #27 from Robin Dapp  ---
Following up on this:

I'm seeing the same thing Patrick does.  We create a lot of large non-sparse
sbitmaps that amount to around 33G in total.

I did local experiments replacing all sbitmaps that are not needed for LCM by
regular bitmaps.  Apart from output differences vs the original version the
testsuite is unchanged.

As expected, wrf now takes longer to compiler, 8 mins vs 4ish mins before and
we still use 2.7G of RAM for this single file (Likely because of the remaining
sbitmaps) compared to a max of 1.2ish G that the rest of the commpilation uses.

One possibility to get the best of both worlds would be to threshold based on
num_bbs * num_exprs.  Once we exceed it switch to the bitmap pass, otherwise
keep sbitmaps for performance. 

Messaging with Juzhe offline, his best guess for the LICM time is that he
enabled checking for dataflow which slows down this particular compilation by a
lot.  Therefore it doesn't look like a generic problem.

[Bug c/113474] RISC-V: Fail to use vmerge.vim for constant vector

2024-01-18 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113474

--- Comment #1 from Robin Dapp  ---
Good catch.  Looks like the ifn expander always forces into a register.  That's
probably necessary on all targets except riscv.

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index a07f25f3aee..e923051d540 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3118,7 +3118,8 @@ expand_vec_cond_mask_optab_fn (internal_fn, gcall *stmt,
convert_optab optab)
   rtx_op2 = expand_normal (op2);

   mask = force_reg (mask_mode, mask);
-  rtx_op1 = force_reg (mode, rtx_op1);
+  if (!insn_operand_matches (icode, 1, rtx_op1))
+rtx_op1 = force_reg (mode, rtx_op1);

   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   create_output_operand ([0], target, mode);

gives me:

foo:
.LFB0:
.cfi_startproc
ble a0,zero,.L5
sllia3,a0,3
add a3,a1,a3
vsetivlizero,4,e32,m1,ta,ma
vmv.v.i v3,15
vmv.v.i v2,0
.L3:
ld  a5,0(a1)
addia4,a5,4
addia5,a5,20
vle32.v v1,0(a5)
vle32.v v0,0(a4)
vmseq.vvv0,v0,v3
vmerge.vim  v4,v2,1,v0
vse32.v v4,0(a4)
vmseq.vvv0,v1,v3
addia1,a1,8
vmerge.vim  v1,v2,1,v0
vse32.v v1,0(a5)
bne a1,a3,.L3
.L5:
ret

[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization

2024-01-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #9 from Robin Dapp  ---
I also noticed this (likely unwanted) vector snippet and wondered where it is
being created.  First I thought it's a vec_extract but doesn't look like it. 
I'm going to check why we create this.

Pan, the test was on real hardware I suppose?  So regardless of the fact that
we likely want to get rid of the snippet above, would you mind checking whether
generic-ooo has any effect on performance?  Maybe you could try -march=rv64gc
-mtune=generic-ooo.  Thanks.

[Bug middle-end/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1

2024-01-10 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971

--- Comment #22 from Robin Dapp  ---
Yes,  going to the thread soon.

[Bug target/113249] RISC-V: regression testsuite errors -mtune=generic-ooo

2024-01-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113249

--- Comment #4 from Robin Dapp  ---
> One of the reasons I've been testing things with generic-ooo is because
> generic-ooo had initial vector pipelines defined. For cleaning up the
> scheduler, I copied over the generic-ooo pipelines into generic and sifive-7
> md files. As you mentioned, the scan dump fails are likely less optimal code
> sequences for the as a result of the cost model. I'm planning on sending up
> a patch in my series that adds -fno-schedule-insns -fno-schedule-insns2 to
> the dump scan tests that fail but do you think it would be better to hard
> code the tune instead?

It's a bit difficult to say, actually both is not ideal but there is no ideal
way anyway :)

Disabling scheduling is probably fine for all the intrinsics tests because it
can be argued that the expected output is very close to the input anyway.

For others it might depend on the intention of the test.  But, in order to get
them out of the way, I think it should be ok to just disabling scheduling and
take care of the intention of the test later.

[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization

2024-01-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #4 from Robin Dapp  ---
The other option is to assert that all tune models have at least a vector cost
model rather than NULL...  But not falling back to the builtin costs still
makes sense.

[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization

2024-01-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #3 from Robin Dapp  ---
Yes, sure and I gave a bit of detail why the values chosen there (same as
aarch64) make sense to me.

Using this generic vector cost model by default without adjusting the latencies
is possible.  I would be OK with such a change but would also rather not have
"rocket" at all by default ;)

[Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization

2024-01-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #1 from Robin Dapp  ---
Hmm, so I tried reproducing this and without a vector cost model we indeed
vectorize.  My qemu dynamic instruction count results are not as abysmal as
yours but still bad enough (20-30% increase in dynamic instructions).

However, as soon as I use the vector cost model, enabled by -mtune=generic-ooo,
the sha256 function is not vectorized anymore:

bla.c:95:5: note: Cost model analysis for part in loop 0:
  Vector cost: 294
  Scalar cost: 185
bla.c:95:5: missed: not vectorized: vectorization is not profitable.

Without that we have:
bla.c:95:5: note: Cost model analysis for part in loop 0:
  Vector cost: 173
  Scalar cost: 185
bla.c:95:5: note: Basic block will be vectorized using SLP

(Those costs are obtained via default_builtin_vectorization_cost).

The main difference is vec_to_scalar cost being 1 by default and 2 in our cost
model, as well as vec_perm = 2.  Given our limited permute capabilities I think
a cost of 2 makes sense.  We can also argue in favor of vec_to_scalar = 2
because we need to slide down elements for extraction and cannot extract
directly.  Setting scalar_to_vec = 2 is debatable and I'd rather keep it at 1.

For the future we need to make a decision whether to continue with generic-ooo
as the default vector model or if we want to set latencies to a few uniform
values in order for scheduling not to introduce spilling and waiting for
dependencies.

To help with that decision you could run some benchmarks with the generic-ooo
tuning and see if things get better or worse?

[Bug target/113281] [14] RISC-V rv64gcv_zvl256b vector: Runtime mismatch with rv64gc

2024-01-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #2 from Robin Dapp  ---
Confirmed.  Funny, we shouldn't vectorize that but really optimize to "return
0".  Costing might be questionable but we also haven't optimized away the loop
when comparing costs.

Disregarding that, of course the vectorization should be correct.

The vect output doesn't really make sense to me but I haven't looked very
closely yet:

  _177 = .SELECT_VL (2, POLY_INT_CST [16, 16]);
  vect_patt_82.18_166 = (vector([16,16]) unsigned short) { 17, 18, 19, ... };
  vect_patt_84.19_168 = MIN_EXPR ;
  vect_patt_85.20_170 = { 32872, ... } >> vect_patt_84.19_168;
  vect_patt_87.21_171 = VIEW_CONVERT_EXPR(vect_patt_85.20_170);
  _173 = _177 + 18446744073709551615;
  # RANGE [irange] short int [0, 16436] MASK 0x7fff VALUE 0x0
  _174 = .VEC_EXTRACT (vect_patt_87.21_171, _173);

vect_patt_85.20_170 should be all zeros and then we'd just vec_extract a 0 and
return that.  However, 32872 >> 15 == 1 so we return 1.

[Bug target/113249] RISC-V: regression testsuite errors -mtune=generic-ooo

2024-01-08 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113249

--- Comment #1 from Robin Dapp  ---
Yes, several (most?) of those are expected because the tests rely on the
default latency model.  One option is to hard code the tune in those tests.
On the other hand the dump tests checking for a more or less optimal code
sequence (under certain conditions and regardless of uarch of course) and
deviation from that sequence might also indicate sub-optimal code.  I commented
on this a bit when first introducing generic-ooo.

If there are new execution failures that would be more concerning and indicate
a real bug.

[Bug target/112999] riscv: Infinite loop with mask extraction

2023-12-15 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999

Robin Dapp  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Robin Dapp  ---
Should be fixed on trunk.

[Bug target/112773] [14 Regression] RISC-V ICE: in force_align_down_and_div, at poly-int.h:1828 on rv32gcv_zvl256b

2023-12-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112773

--- Comment #16 from Robin Dapp  ---
I'd hope it was not fixed by this but just latent because we chose a VLS-mode
vectorization instead.  Hopefully we're better off with the fix than without :)

[Bug target/113014] RISC-V: Redundant zeroing instructions in reduction due to r14-3998-g6223ea766daf7c

2023-12-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014

--- Comment #4 from Robin Dapp  ---
Richard has posted it and asked for reviews.  I have tested it and we have
several testsuite regressions with it but no severe ones.  Most or all of them
are dump fails because we combine into vx variants that would be vv variants
before.
I replied to Richard's post mentioning that we would very much like to see that
go in because it helps us generate the code we want.
To me it appears very likely that it will land.

[Bug target/113014] RISC-V: Redundant zeroing instructions in reduction due to r14-3998-g6223ea766daf7c

2023-12-14 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014

--- Comment #2 from Robin Dapp  ---
Yes, that's right.

[Bug target/112999] riscv: Infinite loop with mask extraction

2023-12-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999

--- Comment #1 from Robin Dapp  ---
What actually gets in the way of vec_extract here is changing to a "better"
vector mode (which is RVVMF4QI here).  If we tried to extract from the mask
directly everything would work directly.

I have a patch locally that does this by refactoring extract_bit_field_1
slightly.  Going to post it soon but not sure if people agree with that idea.

[Bug target/112999] New: riscv: Infinite loop with mask extraction

2023-12-13 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112999

Bug ID: 112999
   Summary: riscv: Infinite loop with mask extraction
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rdapp at gcc dot gnu.org
CC: juzhe.zhong at rivai dot ai, pan2.li at intel dot com
  Target Milestone: ---
Target: riscv

Pan Li found the following problematic case in his "full-coverage" testing and
I'm just documenting it here for reference.

/* { dg-do compile } */
/* { dg-options "-march=rv64gcv_zvl512b -mabi=lp64d
--param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax -O3
-fno-vect-cost-model -fno-tree-loop-distribute-patterns" } */

int a[1024];
int b[1024];

_Bool
fn1 ()
{
  _Bool tem;
  for (int i = 0; i < 1024; ++i)
{
  tem = !a[i];
  b[i] = tem;
}
  return tem;
}

We try to extract the last bit from a 128-bit value of a mask vector.  In order
to do so we first subreg by a tieable vector mode (here RVVMF4QI) then, because
we do not have a RVVMF4QI -> BI vector extraction, try type punning with a
TImode subreg.
As we do not natively support TImode, the result needs to be subreg'd again to
DImode.  In the course of doing so we get lost in subreg moves and hit an
infinite loop.  I have not tracked down the real root cause but the problem is
fixed by providing a movti pattern and special casing subreg:TI extraction from
vectors (just like we do in legitimize_move for other scalar subregs of vectors
- and wich I don't particularly like either :) ).

[Bug middle-end/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1

2023-12-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971

--- Comment #8 from Robin Dapp  ---
Yes, can confirm that this helps.

[Bug target/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1

2023-12-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971

--- Comment #5 from Robin Dapp  ---
Yes that's what I just tried.  No infinite loop anymore then.  But that's not a
new simplification and looks reasonable so there must be something special for
our backend.

[Bug target/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1

2023-12-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971

--- Comment #3 from Robin Dapp  ---
In match.pd we do something like this:


;; Function e (e, funcdef_no=0, decl_uid=2751, cgraph_uid=1, symbol_order=4)


Pass statistics of "forwprop": 

Matching expression match.pd:2771, gimple-match-2.cc:35
Matching expression match.pd:2774, gimple-match-1.cc:66
Matching expression match.pd:2781, gimple-match-2.cc:96
Aborting expression simplification due to deep recursion
Aborting expression simplification due to deep recursion
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
Applying pattern match.pd:6784, gimple-match-5.cc:1742
gimple_simplified to _53 = { 0, ... } & { 8, 7, 6, ... };
_63 = { 0, ... } & { -9, -8, -7, ... };
_52 = { 0, ... } & { 8, 7, 6, ... };
_74 = { 0, ... } & { -9, -8, -7, ... };
_38 = { 0, ... } & { 8, 7, 6, ... };
_40 = { 0, ... } & { -9, -8, -7, ... };
_55 = { 0, ... } & { 8, 7, 6, ... };
_57 = { 0, ... } & { -9, -8, -7, ... };
_65 = { 0, ... } & { 8, 7, 6, ... };
_72 = { 0, ... } & { -9, -8, -7, ... };
_32 = { 0, ... } & { 8, 7, 6, ... };
mask__6.19_61 = _32 == { 0, ... };

That doesn't look particularly backend related but we're trying to simplify a
mask so you never know...

[Bug target/112971] [14] RISC-V rv64gcv_zvl256b vector -O3: internal compiler error: Segmentation fault signal terminated program cc1

2023-12-12 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112971

--- Comment #2 from Robin Dapp  ---
It doesn't look like the same issue to me.  The other bug is related to TImode
handling in combination with mask registers.  I will also have a look at this
one.

[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime

2023-12-11 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929

--- Comment #15 from Robin Dapp  ---
I think we need to make sure that we're not writing out of bounds.  In that
case anything might happen and if we just don't happen to overwrite this
variable we might hit another one but the test can still pass "by accident".

If my analysis is correct (it was just done very quickly) the vl should be 32
at that point and we should not write past that size.
We could have printf output a larger chunk of memory.  Maybe this way we could
see whether something was clobbered even with the newer qemu.

[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-11 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

--- Comment #10 from Robin Dapp  ---
I just realized that I forgot to post the comparison recently.  With the patch
now upstream I don't see any differences for zvl128b and different vlens
anymore.  What I haven't fully tested yet is zvl256b or higher with various
vlens.

[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime

2023-12-11 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929

--- Comment #13 from Robin Dapp  ---
I just built from the most recent commit and it still fails for me.
Could there be a difference in qemu?  I'm on qemu-riscv64 version 8.1.91 but
yours is even newer so that might not explain it.

You could step through until the last vsetvl before the printf and check the vl
after it (or the avl in a4).
As we overwrite the stack it might lead to different outcomes on different
environments.

[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime

2023-12-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929

--- Comment #9 from Robin Dapp  ---
In the good version the length is 32 here because directly before the vsetvl we
have:

li  a4,32

That seems to get lost somehow.

[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime

2023-12-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929

--- Comment #7 from Robin Dapp  ---
Here

0x105c6   vse8.v  v8,(a5)

is where we overwrite m.  The vl is 128 but the preceding vsetvl gets a4 =
46912504507016 as AVL which seems already borken.

[Bug target/112929] [14] RISC-V vector: Variable clobbered at runtime

2023-12-09 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112929

--- Comment #6 from Robin Dapp  ---
This seems to be gone when simple vsetvl (instead of lazy) is used or with
-fno-schedule-insns which might indicate a vsetvl pass problem.

We might have a few more of those.  Maybe it would make sense to run the
testsuite with an RVV-enabled valgrind.  But that might give more false
negatives than real findings :/

[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

--- Comment #8 from Robin Dapp  ---
With Juzhe's latest fix that disables VLS modes >= 128 bit for zvl128b x264
runs without issues here and some of the additional execution failures are
gone.

Will post the current comparison later.

[Bug middle-end/112872] [14 Regression] RISCV ICE: in store_integral_bit_field, at expmed.cc:1049 with -03 rv64gcv_zvl1024b --param=riscv-autovec-preference=fixed-vlmax

2023-12-06 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112872

--- Comment #2 from Robin Dapp  ---
Thanks.  Yes that's similar and also looks fixed by the introduction of the
vec_init expander.  Added this test case to the patch and will push it soon.

[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-05 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

--- Comment #7 from Robin Dapp  ---
Ah, forgot three tests:

FAIL: gcc.dg/vect/bb-slp-cond-1.c execution test
FAIL: gcc.dg/vect/bb-slp-pr101668.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/bb-slp-pr101668.c execution test

On vlen=512

gfortran.dg/array_constructor_4.f90
gfortran.dg/vector_subscript_8.f90
gfortran.fortran-torture/execute/in-pack.f90

are gone again, the rest is similar.  Are those the unstable ones?

[Bug target/112853] RISC-V: RVV: SPEC2017 525.x264 regression

2023-12-05 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112853

--- Comment #6 from Robin Dapp  ---
I indeed see more failures with _zvl128b, vlen=256 (than with _zvl128b,
vlen=128):

FAIL: gcc.dg/vect/pr66251.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/pr66251.c execution test
FAIL: gcc.dg/vect/pr66253.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/pr66253.c execution test
FAIL: gcc.dg/vect/slp-46.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/slp-46.c execution test
FAIL: gcc.dg/vect/vect-alias-check-10.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/vect-alias-check-10.c execution test
FAIL: gcc.dg/vect/vect-alias-check-11.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/vect-alias-check-11.c execution test
FAIL: gcc.dg/vect/vect-alias-check-12.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/vect-alias-check-12.c execution test
FAIL: gcc.dg/vect/vect-alias-check-18.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/vect-alias-check-18.c execution test

FAIL: gfortran.dg/array_constructor_4.f90   -O1  execution test
FAIL: gfortran.dg/associate_18.f08   -O1  execution test
FAIL: gfortran.dg/vector_subscript_8.f90   -O1  execution test
FAIL: gfortran.dg/vector_subscript_8.f90   -O2  execution test
FAIL: gfortran.dg/vector_subscript_8.f90   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
FAIL: gfortran.dg/vector_subscript_8.f90   -O3 -g  execution test

FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O1
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O2
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O2
-fbounds-check
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O2
-fomit-frame-pointer -finline-functions
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O2
-fomit-frame-pointer -finline-functions -funroll-loops
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O3 -g

Maybe those can give a hint.

  1   2   3   >