Re: [PATCH v2 2/3] RISC-V: Add Zalrsc and Zaamo testsuite support

2024-06-12 Thread Jeff Law




On 6/11/24 12:21 PM, Patrick O'Neill wrote:



I made the whitespace cleanup patch (trailing whitespace, leading groups 
of 8 spaces -> tabs) for

target-supports.exp and got a diff of 584 lines.

Is this still worth doing or will it be too disruptive for rebasing/ 
other people's development?
I don't think it's overly disruptive.  This stuff doesn't have a lot of 
churn.  It'd be different if you were reformatting the whole tree :-)


Consider those fixes pre-approved.

jeff



Re: [PATCH 0/3] RISC-V: Amo testsuite cleanup

2024-06-12 Thread Jeff Law




On 6/11/24 12:03 PM, Patrick O'Neill wrote:

This series moves the atomic-related riscv testcases into their own folder and
fixes some minor bugs/rigidity of existing testcases.

This series is OK.
jeff



Re: [PATCH v2] Test: Move target independent test cases to gcc.dg/torture

2024-06-12 Thread Jeff Law




On 6/11/24 8:53 AM, pan2...@intel.com wrote:

From: Pan Li 

The test cases of pr115387 are target independent,  at least x86
and riscv are able to reproduce.  Thus,  move these cases to
the gcc.dg/torture.

The below test suites are passed.
1. The rv64gcv fully regression test.
2. The x86 fully regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/pr115387-1.c: Move to...
* gcc.dg/torture/pr115387-1.c: ...here.
* gcc.target/riscv/pr115387-2.c: Move to...
* gcc.dg/torture/pr115387-2.c: ...here.

OK
jeff



Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-12 Thread Jeff Law




On 6/12/24 12:47 AM, Richard Biener wrote:



One of the points I wanted to make is that sched1 can make quite a
difference as to the relative distance of the store and load and
we have the instruction window the pass considers when scanning
(possibly driven by target uarch details).  So doing the rewriting
before sched1 might be not ideal (but I don't know how much cleanup
work the pass leaves behind - there's nothing between sched1 and RA).
ACK.  I guess I'm just skeptical about much separation we can get in 
practice from scheduling.


As far as cleanup opportunity, it likely comes down to how clean the 
initial codegen is for the bitfield insertion step.






On the hardware side I always wondered whether a failed load-to-store
forward results in the load uop stalling (because the hardware actually
_did_ see the conflict with an in-flight store) or whether this gets
catched later as the hardware speculates a load from L1 (with the
wrong value) but has to roll back because of the conflict.  I would
imagine the latter is cheaper to implement but worse in case of
conflict.
I wouldn't be surprised to see both approaches being used and I suspect 
it really depends on how speculative your uarch is.  At some point 
there's enough speculation going on that you can't detect the violation 
early enough and you have to implement a replay/rollback scheme.


jeff


Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-11 Thread Jeff Law




On 6/11/24 7:52 AM, Philipp Tomsich wrote:

On Tue, 11 Jun 2024 at 15:37, Jeff Law  wrote:




On 6/11/24 1:22 AM, Richard Biener wrote:


Absolutely.   But forwarding from a smaller store to a wider load is painful
from a hardware standpoint and if we can avoid it from a codegen standpoint,
we should.


Note there's also the possibility to increase the distance between the
store and the load - in fact the time a store takes to a) retire and
b) get from the store buffers to where the load-store unit would pick it
up (L1-D) is another target specific tuning knob.  That said, if that
distance isn't too large (on x86 there might be only an upper bound
given by the OOO window size and the L1D store latency(?), possibly
also additionally by the store buffer size) attacking the issue in
sched1 or sched2 might be another possibility.  So I think pass placement
is another thing to look at - I'd definitely place it after sched1
but I guess without looking at the pass again it's way before that?

True, but I doubt there are enough instructions we could sink the load
past to make a measurable difference.  This is especially true on the
class of uarchs where this is going to be most important.

In the case where the store/load can't be interchanged and thus this new
pass rejects any transformation, we could try to do something in the
scheduler to defer the load as long as possible.  Essentially it's a
true dependency through a memory location using must-aliasing properties
and in that case we'd want to crank up the "latency" of the store so
that the load gets pushed away.

I think one of the difficulties here is we often model stores as not
having any latency (which is probably OK in most cases).  Input data
dependencies and structural hazards dominate dominate considerations for
stores.


I don't think that TARGET_SCHED_ADJUST_COST would even be called for a
data-dependence through a memory location.
Probably correct, but we could adjust that behavior or add another 
mechanism to adjust costs based on memory dependencies.




Note that, strictly speaking, the store does not have an extended
latency; it will be the load that will have an increased latency
(almost as if we knew that the load will miss to one of the outer
points-of-coherence).  The difference being that the load would not
hang around in a scheduling queue until being dispatched, but its
execution would start immediately and take more cycles (and
potentially block an execution pipeline for longer).
Absolutely true.  I'm being imprecise in my language, increasing the 
"latency" of the store is really a proxy for "do something to encourage 
the load to move away from the store".


But overall rewriting the sequence is probably the better choice.  In my 
mind the scheduler approach would be a secondary attempt if we couldn't 
interchange the store/load.  And I'd make a small bet that its impact 
would be on the margins if we're doing a reasonable job in the new pass.


Jeff



Re: [PATCH v1] Test: Move target independent test cases to gcc.dg/torture

2024-06-11 Thread Jeff Law




On 6/11/24 12:19 AM, pan2...@intel.com wrote:

From: Pan Li 

The test cases of pr115387 are target independent,  at least x86
and riscv are able to reproduce.  Thus,  move these cases to
the gcc.dg/torture.

The below test suites are passed.
1. The rv64gcv fully regression test.
2. The x86 fully regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/pr115387-1.c: Move to...
* gcc.dg/torture/pr115387-1.c: ...here.
* gcc.target/riscv/pr115387-2.c: Move to...
* gcc.dg/torture/pr115387-2.c: ...here.

OK
jeff



Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-11 Thread Jeff Law




On 6/11/24 1:22 AM, Richard Biener wrote:


Absolutely.   But forwarding from a smaller store to a wider load is painful
from a hardware standpoint and if we can avoid it from a codegen standpoint,
we should.


Note there's also the possibility to increase the distance between the
store and the load - in fact the time a store takes to a) retire and
b) get from the store buffers to where the load-store unit would pick it
up (L1-D) is another target specific tuning knob.  That said, if that
distance isn't too large (on x86 there might be only an upper bound
given by the OOO window size and the L1D store latency(?), possibly
also additionally by the store buffer size) attacking the issue in
sched1 or sched2 might be another possibility.  So I think pass placement
is another thing to look at - I'd definitely place it after sched1
but I guess without looking at the pass again it's way before that?
True, but I doubt there are enough instructions we could sink the load 
past to make a measurable difference.  This is especially true on the 
class of uarchs where this is going to be most important.


In the case where the store/load can't be interchanged and thus this new 
pass rejects any transformation, we could try to do something in the 
scheduler to defer the load as long as possible.  Essentially it's a 
true dependency through a memory location using must-aliasing properties 
and in that case we'd want to crank up the "latency" of the store so 
that the load gets pushed away.


I think one of the difficulties here is we often model stores as not 
having any latency (which is probably OK in most cases).  Input data 
dependencies and structural hazards dominate dominate considerations for 
stores.


jeff




[committed] [RISC-V] Drop dead test

2024-06-10 Thread Jeff Law
This test is no longer useful.  It doesn't test what it was originally 
intended to test and there's really no way to recover it sanely.


We agreed in the patchwork meeting last week that if we want to test Zfa 
that we'll write a new test for that.  Similarly if we want to do deeper 
testing of the non-Zfa sequences in this space that we'd write new tests 
for those as well (execution tests in particular).


So dropping this test.

Jeffcommit 95161c6abfbd7ba9fab0b538ccc885f5980efbee
Author: Jeff Law 
Date:   Mon Jun 10 22:39:40 2024 -0600

[committed] [RISC-V] Drop dead round_32 test

This test is no longer useful.  It doesn't test what it was originally 
intended
to test and there's really no way to recover it sanely.

We agreed in the patchwork meeting last week that if we want to test Zfa 
that
we'll write a new test for that.  Similarly if we want to do deeper testing 
of
the non-Zfa sequences in this space that we'd write new tests for those as 
well
(execution tests in particular).

So dropping this test.

gcc/testsuite
* gcc.target/riscv/round_32.c: Delete.

diff --git a/gcc/testsuite/gcc.target/riscv/round_32.c 
b/gcc/testsuite/gcc.target/riscv/round_32.c
deleted file mode 100644
index 88ff77aff2e..000
--- a/gcc/testsuite/gcc.target/riscv/round_32.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/* { dg-do compile { target { riscv32*-*-* } } } */
-/* { dg-require-effective-target glibc } */
-/* { dg-options "-march=rv32gc -mabi=ilp32d -fno-math-errno 
-funsafe-math-optimizations -fno-inline" } */
-/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" } } */
-
-#include "round.c"
-
-/* { dg-final { scan-assembler-times {\mfcvt.w.s} 15 } } */
-/* { dg-final { scan-assembler-times {\mfcvt.s.w} 5 } } */
-/* { dg-final { scan-assembler-times {\mfcvt.d.w} 65 } } */
-/* { dg-final { scan-assembler-times {\mfcvt.w.d} 15 } } */
-/* { dg-final { scan-assembler-times {,rup} 6 } } */
-/* { dg-final { scan-assembler-times {,rmm} 6 } } */
-/* { dg-final { scan-assembler-times {,rdn} 6 } } */
-/* { dg-final { scan-assembler-times {,rtz} 6 } } */
-/* { dg-final { scan-assembler-not {\mfcvt.l.d} } } */
-/* { dg-final { scan-assembler-not {\mfcvt.d.l} } } */
-/* { dg-final { scan-assembler-not "\\sceil\\s" } } */
-/* { dg-final { scan-assembler-not "\\sfloor\\s" } } */
-/* { dg-final { scan-assembler-not "\\sround\\s" } } */
-/* { dg-final { scan-assembler-not "\\snearbyint\\s" } } */
-/* { dg-final { scan-assembler-not "\\srint\\s" } } */
-/* { dg-final { scan-assembler-not "\\stail\\s" } } */


Re: [PATCH v3 0/3] RISC-V: Add basic Zaamo and Zalrsc support

2024-06-10 Thread Jeff Law




On 6/10/24 3:46 PM, Patrick O'Neill wrote:

The A extension has been split into two parts: Zaamo and Zalrsc.
This patch adds basic support by making the A extension imply Zaamo and
Zalrsc.

Zaamo/Zalrsc spec: https://github.com/riscv/riscv-zaamo-zalrsc/tags
Ratification: https://jira.riscv.org/browse/RVS-1995

v2:
Rebased and updated some testcases that rely on the ISA string.

v3:
Regex-ify temp registers in added testcases.
Remove unintentional whitespace changes.
Add riscv_{a|ztso|zaamo|zalrsc} docs to sourcebuild.texi (and move core-v bi
extension doc into appropriate section).

Edwin Lu (1):
   RISC-V: Add basic Zaamo and Zalrsc support

Patrick O'Neill (2):
   RISC-V: Add Zalrsc and Zaamo testsuite support
   RISC-V: Add Zalrsc amo-op patterns

  gcc/common/config/riscv/riscv-common.cc   |  11 +-
  gcc/config/riscv/arch-canonicalize|   1 +
  gcc/config/riscv/riscv.opt|   6 +-
  gcc/config/riscv/sync.md  | 152 +++---
  gcc/doc/sourcebuild.texi  |  16 +-
  .../riscv/amo-table-a-6-amo-add-1.c   |   2 +-
  .../riscv/amo-table-a-6-amo-add-2.c   |   2 +-
  .../riscv/amo-table-a-6-amo-add-3.c   |   2 +-
  .../riscv/amo-table-a-6-amo-add-4.c   |   2 +-
  .../riscv/amo-table-a-6-amo-add-5.c   |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-1.c  |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-2.c  |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-3.c  |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-4.c  |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-5.c  |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-6.c  |   2 +-
  .../riscv/amo-table-a-6-compare-exchange-7.c  |   2 +-
  .../riscv/amo-table-a-6-subword-amo-add-1.c   |   2 +-
  .../riscv/amo-table-a-6-subword-amo-add-2.c   |   2 +-
  .../riscv/amo-table-a-6-subword-amo-add-3.c   |   2 +-
  .../riscv/amo-table-a-6-subword-amo-add-4.c   |   2 +-
  .../riscv/amo-table-a-6-subword-amo-add-5.c   |   2 +-
  .../riscv/amo-table-ztso-amo-add-1.c  |   2 +-
  .../riscv/amo-table-ztso-amo-add-2.c  |   2 +-
  .../riscv/amo-table-ztso-amo-add-3.c  |   2 +-
  .../riscv/amo-table-ztso-amo-add-4.c  |   2 +-
  .../riscv/amo-table-ztso-amo-add-5.c  |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-1.c |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-2.c |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-3.c |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-4.c |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-5.c |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-6.c |   2 +-
  .../riscv/amo-table-ztso-compare-exchange-7.c |   2 +-
  .../riscv/amo-table-ztso-subword-amo-add-1.c  |   2 +-
  .../riscv/amo-table-ztso-subword-amo-add-2.c  |   2 +-
  .../riscv/amo-table-ztso-subword-amo-add-3.c  |   2 +-
  .../riscv/amo-table-ztso-subword-amo-add-4.c  |   2 +-
  .../riscv/amo-table-ztso-subword-amo-add-5.c  |   2 +-
  .../riscv/amo-zaamo-preferred-over-zalrsc.c   |  17 ++
  .../gcc.target/riscv/amo-zalrsc-amo-add-1.c   |  19 +++
  .../gcc.target/riscv/amo-zalrsc-amo-add-2.c   |  19 +++
  .../gcc.target/riscv/amo-zalrsc-amo-add-3.c   |  19 +++
  .../gcc.target/riscv/amo-zalrsc-amo-add-4.c   |  19 +++
  .../gcc.target/riscv/amo-zalrsc-amo-add-5.c   |  19 +++
  gcc/testsuite/gcc.target/riscv/attribute-15.c |   2 +-
  gcc/testsuite/gcc.target/riscv/attribute-16.c |   2 +-
  gcc/testsuite/gcc.target/riscv/attribute-17.c |   2 +-
  gcc/testsuite/gcc.target/riscv/attribute-18.c |   2 +-
  gcc/testsuite/gcc.target/riscv/pr110696.c |   2 +-
  .../gcc.target/riscv/rvv/base/pr114352-1.c|   4 +-
  .../gcc.target/riscv/rvv/base/pr114352-3.c|   8 +-
  gcc/testsuite/lib/target-supports.exp |  48 +-
  53 files changed, 366 insertions(+), 70 deletions(-)
  create mode 100644 
gcc/testsuite/gcc.target/riscv/amo-zaamo-preferred-over-zalrsc.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-1.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-2.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-3.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-4.c
  create mode 100644 gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-5.c

This series is OK for the trunk.

jeff



Re: [PATCH v3 0/3] RISC-V: Add basic Zaamo and Zalrsc support

2024-06-10 Thread Jeff Law




On 6/10/24 6:15 PM, Andrea Parri wrote:

On Mon, Jun 10, 2024 at 02:46:54PM -0700, Patrick O'Neill wrote:

The A extension has been split into two parts: Zaamo and Zalrsc.
This patch adds basic support by making the A extension imply Zaamo and
Zalrsc.

Zaamo/Zalrsc spec: https://github.com/riscv/riscv-zaamo-zalrsc/tags
Ratification: https://jira.riscv.org/browse/RVS-1995

v2:
Rebased and updated some testcases that rely on the ISA string.

v3:
Regex-ify temp registers in added testcases.
Remove unintentional whitespace changes.
Add riscv_{a|ztso|zaamo|zalrsc} docs to sourcebuild.texi (and move core-v bi
extension doc into appropriate section).

Edwin Lu (1):
   RISC-V: Add basic Zaamo and Zalrsc support

Patrick O'Neill (2):
   RISC-V: Add Zalrsc and Zaamo testsuite support
   RISC-V: Add Zalrsc amo-op patterns


While providing a proper/detailed review of the series goes above my
"GCC internals" skills, I've applied the series and checked that the
generated code for some atomic operations meet expectations (expecta-
tions which, w/ "only Zaamo", are arguably quite low as mentioned in
v2 and elsewhere):
Thanks for taking the time.  We realize you're not a GCC expert, but 
having an extra pair of eyes on the atomics is always appreciated.




Tested-by: Andrea Parri 

   Andrea


P.S. Unrelated to the changes at stake, but perhaps worth mentioning:
w/ and w/o these changes, the following

[ ... ]
I'll leave this to Patrick to decide if he wants to update.  I'm always 
hesitant to weaken this stuff as I'm sure there's somebody, somewhere 
that assumes the stronger primitives.


Jeff



Re: [PATCH v1] Widening-Mul: Fix one ICE of gcall insertion for PHI match

2024-06-10 Thread Jeff Law




On 6/10/24 7:28 PM, Li, Pan2 wrote:

Hi Sam,


This testcases ICEs for me on x86-64 too (without your patch) with just -O2.
Can you move it out of the riscv suite? (I suspect the other fails on x86-64 
too).


Sure thing, but do you have any suggestion about where should I put these 2 
cases?
There are sorts of sub-directories under gcc/testsuite, I am not very familiar 
that where
is the best reasonable location.

gcc.dg/torture would be the most natural location I think.

jeff



[to-be-committed] [RISC-V] Improve (1 << N) | C for rv64

2024-06-10 Thread Jeff Law

Another improvement for generating Zbs instructions.

In this case we're looking at stuff like (1 << N) | C where N varies and 
C is a single bit constant.


In this pattern the (1 << N) happens in SImode, but is zero extended out 
to DImode before the bit manipulation.  The fact that we're modifying a 
DImode object in the logical op is important as it means we don't have 
to worry about whether or not the resulting value is sign extended from 
SI to DI.


This has run through Ventana's CI system.  I'll wait for it to roll 
through pre-commit CI before moving forward.


Jeff



gcc/
* bitmanip.md ((1 << N) | C)): New splitter for IOR/XOR of
a single bit an a DImode object.

gcc/testsuite/

* gcc.target/riscv/zbs-zext.c: New test.

diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 6c2736454aa..3cc244898e7 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -727,6 +727,21 @@ (define_insn "*bsetidisi"
   "bseti\t%0,%1,%S2"
   [(set_attr "type" "bitmanip")])
 
+;; We can easily handle zero extensions
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+(any_or:DI (zero_extend:DI
+(ashift:SI (const_int 1)
+   (match_operand:QI 1 "register_operand")))
+  (match_operand:DI 2 "single_bit_mask_operand")))
+   (clobber (match_operand:DI 3 "register_operand"))]
+  "TARGET_64BIT && TARGET_ZBS"
+  [(set (match_dup 3)
+(match_dup 2))
+   (set (match_dup 0)
+ (any_or:DI (ashift:DI (const_int 1) (match_dup 1))
+   (match_dup 3)))])
+
 (define_insn "*bclr"
   [(set (match_operand:X 0 "register_operand" "=r")
(and:X (rotate:X (const_int -2)
diff --git a/gcc/testsuite/gcc.target/riscv/zbs-zext.c 
b/gcc/testsuite/gcc.target/riscv/zbs-zext.c
new file mode 100644
index 000..5773b15d298
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/zbs-zext.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc_zbs -mabi=lp64" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O1" } } */
+typedef unsigned long uint64_t;
+typedef unsigned int uint32_t;
+
+uint64_t bset (const uint32_t i)
+{
+  uint64_t checks = 8;
+  checks |= 1U << i;
+  return checks;
+}
+
+uint64_t binv (const uint32_t i)
+{
+  uint64_t checks = 8;
+  checks ^= 1U << i;
+  return checks;
+}
+
+uint64_t bclr (const uint32_t i)
+{
+  uint64_t checks = 10;
+  checks &= ~(1U << i);
+  return checks;
+}
+
+/* { dg-final { scan-assembler-times "bset\t" 1 } } */
+/* { dg-final { scan-assembler-times "binv\t" 1 } } */
+/* { dg-final { scan-assembler-times "bclr\t" 1 } } */
+/* { dg-final { scan-assembler-not "sllw\t"} } */


Re: [PATCH v1] Widening-Mul: Fix one ICE of gcall insertion for PHI match

2024-06-10 Thread Jeff Law




On 6/10/24 8:49 AM, pan2...@intel.com wrote:

When enabled the PHI handing for COND_EXPR,  we need to insert the gcall
to replace the PHI node.  Unfortunately,  I made a mistake that insert
the gcall to before the last stmt of the bb.  See below gimple,  the PHI
is located at no.1 but we insert the gcall (aka no.9) to the end of
the bb.  Then the use of _9 in no.2 will have no def and will trigger
ICE when verify_ssa.

   1. # _9 = PHI <_3(4), 18446744073709551615(3)> // The PHI node to be deleted.
   2. prephitmp_36 = (char *) _9;
   3. buf.write_base = string_13(D);
   4. buf.write_ptr = string_13(D);
   5. buf.write_end = prephitmp_36;
   6. buf.written = 0;
   7. buf.mode = 3;
   8. _7 = buf.write_end;
   9. _9 = .SAT_ADD (string.0_2, maxlen_15(D));   // Insert gcall to last bb by 
mistake

This patch would like to insert the gcall to before the start of the bb
stmt.  To ensure the possible use of PHI_result will have a def exists.
After this patch the above gimple will be:

   0. _9 = .SAT_ADD (string.0_2, maxlen_15(D));   // Insert gcall to start bb 
by mistake
   1. # _9 = PHI <_3(4), 18446744073709551615(3)> // The PHI node to be deleted.
   2. prephitmp_36 = (char *) _9;
   3. buf.write_base = string_13(D);
   4. buf.write_ptr = string_13(D);
   5. buf.write_end = prephitmp_36;
   6. buf.written = 0;
   7. buf.mode = 3;
   8. _7 = buf.write_end;

The below test suites are passed for this patch:
* The rv64gcv fully regression test with newlib.
* The rv64gcv build with glibc.
* The x86 regression test with newlib.
* The x86 bootstrap test with newlib.

PR target/115387

gcc/ChangeLog:

* tree-ssa-math-opts.cc (math_opts_dom_walker::after_dom_children): Take
the gsi of start_bb instead of last_bb.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/pr115387-1.c: New test.
* gcc.target/riscv/pr115387-2.c: New test.

I did a fresh x86_64 bootstrap and regression test and pushed this.

jeff



Re: [PATCH] Move array_bounds warnings into it's own pass.

2024-06-10 Thread Jeff Law




On 6/10/24 1:24 PM, Andrew MacLeod wrote:
The array bounds warning pass was originally attached to the VRP pass 
because it wanted to leverage the context sensitive ranges available there.


With ranger, we can make it a pass of its own for very little cost. This 
patch does that. It removes the array_bounds_checker from VRP and makes 
it a solo pass that runs immediately after VRP1.


The original version had VRP add any un-executable edge flags it found, 
but I could not find a case where after VRP cleans up the CFG the new 
pass needed that.  I also did not find a case where activating SCEV 
again for the warning pass made a difference after VRP had run.  So this 
patch does neither of those things.


It simple enough to later add SCEV and loop analysis again if it turns 
out to be important.


My primary motivation for removing it was to remove the second DOM walk 
the checker performs which depends on on-demand ranges pre-cached by 
ranger.   This prevented VRP from choosing an alternative VRP solution 
when basic block counts are very high (PR  114855).  I also know Siddesh 
want to experiment with moving the pass later in the pipeline as well, 
which will make that task much simpler as a secondary rationale.


I didn't want to mess with the internal code much. For a multitude of 
reasons.  I did change it so that it always uses the current range_query 
object instead of passing one in to the constructor.  And then I cleaned 
up the VRP code ot no longer take a flag on whether to invoke the 
warning code or not.


The final bit is the pass is set to only run when flag_tree_vrp is on.. 
I did this primarily to preserve existing functionality, and some tests 
depended on it.  ie, would turn on -warray-bounds and disables tree-vrp 
pass (which means the  bounds checker doesnt run) ... which changes the 
expected warnings from the strlen pass.    I'm not going there.    there 
are  also tests which run at -O1 and -Wall that do not expect the bounds 
checker to run either.   So this dependence on the vrp flag is 
documented in the code an preserves existing behavior.


Does anyone have any issues with any of this?
No, in fact, quite the opposite.  I think we very much want the warning 
out of VRP into its own little pass that we can put wherever it makes 
sense in the pipeline rather than having it be tied to VRP.


I'd probably look at the -O1 vs -Wall stuff independently so that we 
could (in theory) eventually remove the dependence on flag_vrp.


jeff




Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-10 Thread Jeff Law




On 6/10/24 12:27 PM, Philipp Tomsich wrote:



This change is what I briefly hinted as "the complete solution" that
we had on the drawing board when we briefly talked last November in
Santa Clara.
I haven't any recollection of that part of the discussion, but I was a 
bit frazzled as you probably noticed.




  We have looked at all of SPEC2017, especially for coverage (i.e.,
making sure we see a significant number of uses of the transformation)
and correctness.  The gcc_r and parest_r components triggered in a
number of "interesting" ways (e.g., motivating the case of
load-elimination).  If it helps, we could share the statistics for how
often the pass triggers on compiling each of the SPEC2017 components?
Definitely helpful.  I may be able to juggle some priorities internally 
to lend a larger hand on testing and helping move this forward.  It's an 
area we're definitely interested in.


Jeff


Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-10 Thread Jeff Law




On 6/10/24 1:55 AM, Manolis Tsamis wrote:




There was an older submission of a load-pair specific pass but this is
a complete reimplementation and indeed significantly more general.
Apart from being target independant, it addresses a number of
important restrictions and can handle multiple store forwardings per
load.
It should be noted that it cannot handle the load-pair cases as these
need special handling, but that's something we're planning to do in
the future by reusing this infrastructure.

ACK.  Thanks for the additional background.







diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4e8967fd8ab..c769744d178 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12657,6 +12657,15 @@ loop unrolling.
   This option is enabled by default at optimization levels @option{-O1},
   @option{-O2}, @option{-O3}, @option{-Os}.

+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.

Is there any particular reason why this would be off by default at -O1
or higher?  It would seem to me that on modern cores that this
transformation should easily be a win.  Even on an old in-order core,
avoiding the load with the bit insert is likely profitable, just not as
much so.


I don't have a strong opinion for that but I believe Richard's
suggestion to decide this on a per-target basis also makes a lot of
sense.
Deciding whether the transformation is profitable is tightly tied to
the architecture in question (i.e. how large the stall is and what
sort of bit-insert instructions are available).
In order to make this more widely applicable, I think we'll need a
target hook that decides in which case the forwarded stores incur a
penalty and thus the transformation makes sense.
You and Richi are probably right.   I'm not a big fan of passes being 
enabled/disabled on particular targets, but it may make sense here.





Afaik, for each CPU there may be cases that store forwarding is
handled efficiently.
Absolutely.   But forwarding from a smaller store to a wider load is 
painful from a hardware standpoint and if we can avoid it from a codegen 
standpoint, we should.


Did y'all look at spec2017 at all for this patch?  I've got our hardware 
guys to expose a signal for this case so that we can (in a month or so) 
get some hard data on how often it's happening in spec2017 and evaluate 
how this patch helps the most affected workloads.  But if y'all already 
have some data we can use it as a starting point.


jeff


Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-06-10 Thread Jeff Law




On 6/10/24 8:52 AM, Li, Pan2 wrote:

Not sure if below float eq implement in sail-riscv is useful or not, but looks 
like some special handling for nan, as well as snan.

https://github.com/riscv/sail-riscv/blob/master/c_emulator/SoftFloat-3e/source/f32_eq.c

Yes, but it's symmetrical, which is what we'd want to see.

jeff



Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-06-10 Thread Jeff Law




On 6/10/24 10:16 AM, Demin Han wrote:

Hi,

I‘m on vacation rencently.
I will return in a few days and summit new patch with the test.

No problem.  Enjoy your vacation, this can certainly wait until you return.

jeff



Re: [PATCH v1] Widening-Mul: Fix one ICE of gcall insertion for PHI match

2024-06-10 Thread Jeff Law




On 6/10/24 8:49 AM, pan2...@intel.com wrote:

From: Pan Li 

When enabled the PHI handing for COND_EXPR,  we need to insert the gcall
to replace the PHI node.  Unfortunately,  I made a mistake that insert
the gcall to before the last stmt of the bb.  See below gimple,  the PHI
is located at no.1 but we insert the gcall (aka no.9) to the end of
the bb.  Then the use of _9 in no.2 will have no def and will trigger
ICE when verify_ssa.

   1. # _9 = PHI <_3(4), 18446744073709551615(3)> // The PHI node to be deleted.
   2. prephitmp_36 = (char *) _9;
   3. buf.write_base = string_13(D);
   4. buf.write_ptr = string_13(D);
   5. buf.write_end = prephitmp_36;
   6. buf.written = 0;
   7. buf.mode = 3;
   8. _7 = buf.write_end;
   9. _9 = .SAT_ADD (string.0_2, maxlen_15(D));   // Insert gcall to last bb by 
mistake

This patch would like to insert the gcall to before the start of the bb
stmt.  To ensure the possible use of PHI_result will have a def exists.
After this patch the above gimple will be:

   0. _9 = .SAT_ADD (string.0_2, maxlen_15(D));   // Insert gcall to start bb 
by mistake
   1. # _9 = PHI <_3(4), 18446744073709551615(3)> // The PHI node to be deleted.
   2. prephitmp_36 = (char *) _9;
   3. buf.write_base = string_13(D);
   4. buf.write_ptr = string_13(D);
   5. buf.write_end = prephitmp_36;
   6. buf.written = 0;
   7. buf.mode = 3;
   8. _7 = buf.write_end;

The below test suites are passed for this patch:
* The rv64gcv fully regression test with newlib.
* The rv64gcv build with glibc.
* The x86 regression test with newlib.
* The x86 bootstrap test with newlib.
So the patch looks fine.  I'm just trying to parse the testing.  If you 
did an x86 bootstrap & regression test, you wouldn't be using newlib. 
That would be a native bootstrap & regression test which would use 
whatever C library is already installed on the system.  I'm assuming 
that's what you did.


If my assumption is correct, then this is fine for the trunk.

jeff



Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-06-10 Thread Jeff Law




On 6/10/24 1:33 AM, Robin Dapp wrote:

But isn't canonicalization of EQ/NE safe, even for IEEE NaN and +-0.0?

target = (a == b) ? x : y
target = (a != b) ? y : x

Are equivalent, even for IEEE IIRC.


Yes, that should be fine.  My concern was not that we do a
canonicalization but that we might not do it for some of the
vector cases.  In particular when one of the operands is wrapped
in a vec_duplicate and we end up with it first rather than
second.

My general feeling is that the patch is good but I wasn't entirely
sure about all cases (in particular in case we transform something
after expand).  That's why I would have liked to see at least some
small test cases for it along with the patch (for the combinations
we don't test yet).

Ah, OK.

Demin, can you some additional test coverage, guided by Robin's concerns 
above?


Thanks,
jeff



[to-be-committed][RISC-V] Generate bclr more often for rv64

2024-06-10 Thread Jeff Law
Another of Raphael's patches to improve our ability to safely generate a 
Zbs instruction, bclr in this instance.


In this case we have something like ~(1 << N) & C where N is variable, 
but C is a constant.  If C has 33 or more leading zeros, then no matter 
what bit we clear via bclr, the result will always have at least bits 
31..63 clear.  So we don't have to worry about any of the extension 
issues with SI objects in rv64.


Odds are this was seen in spec at some point by the RAU team, thus 
leading to Raphael's pattern.


Anyway, this has been through Ventana's CI system in the past.  I'll 
wait for it to work through upstream pre-commit CI before taking further 
action, but the plan is to commit after successful CI run.


Jeff



gcc/

* config/riscv/bitmanip.md ((~1 << N) & C): New splitter.

gcc/testsuite/

* gcc.target/riscv/zbs-ext.c: New test.


diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 6559d4d6950..4361be1c265 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -784,6 +784,23 @@ (define_insn_and_split "*bclridisi_nottwobits"
 }
 [(set_attr "type" "bitmanip")])
 
+;; An outer AND with a constant where bits 31..63 are 0 can be seen as
+;; a virtual zero extension from 31 to 64 bits.
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+(and:DI (not:DI (subreg:DI
+ (ashift:SI (const_int 1)
+(match_operand:QI 1 "register_operand")) 0))
+(match_operand:DI 2 "arith_operand")))
+   (clobber (match_operand:DI 3 "register_operand"))]
+  "TARGET_64BIT && TARGET_ZBS
+   && clz_hwi (INTVAL (operands[2])) >= 33"
+  [(set (match_dup 3)
+(match_dup 2))
+   (set (match_dup 0)
+ (and:DI (rotate:DI (const_int -2) (match_dup 1))
+ (match_dup 3)))])
+
 (define_insn "*binv"
   [(set (match_operand:X 0 "register_operand" "=r")
(xor:X (ashift:X (const_int 1)
diff --git a/gcc/testsuite/gcc.target/riscv/zbs-ext.c 
b/gcc/testsuite/gcc.target/riscv/zbs-ext.c
new file mode 100644
index 000..65f42545b5f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/zbs-ext.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc_zbs -mabi=lp64" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O1" } } */
+typedef unsigned long uint64_t;
+typedef unsigned int uint32_t;
+
+uint64_t bclr (const uint32_t i)
+{
+  uint64_t checks = 10;
+  checks &= ~(1U << i);
+  return checks;
+}
+
+/* { dg-final { scan-assembler-times "bclr\t" 1 } } */
+/* { dg-final { scan-assembler-not "sllw\t"} } */


[to-be-committed] [RISC-V] Use bext for extracting a bit into a SImode object

2024-06-09 Thread Jeff Law
bext is defined as (src >> n) & 1.  With that formulation, particularly 
the "&1" means the result is implicitly zero extended.  So we can safely 
use it on SI objects for rv64 without the need to do any explicit extension.


This patch adds the obvious pattern and a few testcases.   I think one 
of the tests is derived from coremark, the other two from spec2017.


This has churned through Ventana's CI system repeatedly since it was 
first written.  Assuming pre-commit CI doesn't complain, I'll commit it 
on Raphael's behalf later today or Monday.



Jeff

gcc/
* config/riscv/bitmanip.md (*bextdisi): New pattern.

gcc/testsuite

* gcc.target/riscv/bext-ext.c: New test.

commit e32599b6c863cffa594ab1eca8f4e11562c4bc6a
Author: Raphael Zinsly 
Date:   Fri Mar 22 16:20:21 2024 -0600

Improvement to bext discovery from Raphael.  Extracted from his "Add Zbs 
extended patterns" MR.

diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 00560be6161..6559d4d6950 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -854,6 +854,18 @@ (define_insn_and_split "*bextseqzdisi"
   "operands[1] = gen_lowpart (word_mode, operands[1]);"
   [(set_attr "type" "bitmanip")])
 
+;; The logical-and against 0x1 implicitly extends the result.   So we can treat
+;; an SImode bext as-if it's DImode without any explicit extension.
+(define_insn "*bextdisi"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+(and:DI (subreg:DI (lshiftrt:SI
+(match_operand:SI 1 "register_operand" "r")
+(match_operand:QI 2 "register_operand" "r")) 0)
+(const_int 1)))]
+  "TARGET_64BIT && TARGET_ZBS"
+  "bext\t%0,%1,%2"
+  [(set_attr "type" "bitmanip")])
+
 ;; When performing `(a & (1UL << bitno)) ? 0 : -1` the combiner
 ;; usually has the `bitno` typed as X-mode (i.e. no further
 ;; zero-extension is performed around the bitno).
diff --git a/gcc/testsuite/gcc.target/riscv/bext-ext.c 
b/gcc/testsuite/gcc.target/riscv/bext-ext.c
new file mode 100644
index 000..eeef07d7013
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/bext-ext.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc_zbs -mabi=lp64" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O1" } } */
+typedef unsigned long uint64_t;
+typedef unsigned int uint32_t;
+
+uint64_t bext1 (int dst, const uint32_t i)
+{
+  uint64_t checks = 1U;
+  checks &= dst >> i;
+  return checks;
+}
+
+int bext2 (int dst, int i_denom)
+{
+  dst = 1 & (dst >> i_denom);
+  return dst;
+}
+
+const uint32_t bext3 (uint32_t bit_count, uint32_t symbol)
+{
+  return (symbol >> bit_count) & 1;
+}
+
+/* { dg-final { scan-assembler-times "bext\t" 3 } } */
+/* { dg-final { scan-assembler-not "sllw\t"} } */
+/* { dg-final { scan-assembler-not "srlw\t"} } */


[committed] [RISC-V] Fix false-positive uninitialized variable

2024-06-09 Thread Jeff Law
Andreas noted we were getting an uninit warning after the recent 
constant synthesis changes.  Essentially there's no way for the uninit 
analysis code to know the first entry in the CODES array is a UNKNOWN 
which will set X before its first use.


So trivial initialization with NULL_RTX is the obvious fix.

Pushed to the trunk.

Jeff

commit 932c6f8dd8859afb13475c2de466bd1a159530da
Author: Jeff Law 
Date:   Sun Jun 9 09:17:55 2024 -0600

[committed] [RISC-V] Fix false-positive uninitialized variable

Andreas noted we were getting an uninit warning after the recent constant
synthesis changes.  Essentially there's no way for the uninit analysis code 
to
know the first entry in the CODES array is a UNKNOWN which will set X before
its first use.

So trivial initialization with NULL_RTX is the obvious fix.

Pushed to the trunk.

gcc/

* config/riscv/riscv.cc (riscv_move_integer): Initialize "x".

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 95f3636f8e4..c17141d909a 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -2720,7 +2720,7 @@ riscv_move_integer (rtx temp, rtx dest, HOST_WIDE_INT 
value,
   struct riscv_integer_op codes[RISCV_MAX_INTEGER_OPS];
   machine_mode mode;
   int i, num_ops;
-  rtx x;
+  rtx x = NULL_RTX;
 
   mode = GET_MODE (dest);
   /* We use the original mode for the riscv_build_integer call, because HImode


Re: [to-be-committed] [RISC-V] Use Zbkb for general 64 bit constants when profitable

2024-06-09 Thread Jeff Law




On 6/7/24 11:49 AM, Andreas Schwab wrote:

In file included from ../../gcc/rtl.h:3973,
  from ../../gcc/config/riscv/riscv.cc:31:
In function 'rtx_def* init_rtx_fmt_ee(rtx, machine_mode, rtx, rtx)',
 inlined from 'rtx_def* gen_rtx_fmt_ee_stat(rtx_code, machine_mode, rtx, 
rtx)' at ./genrtl.h:50:26,
 inlined from 'void riscv_move_integer(rtx, rtx, long int, machine_mode)' 
at ../../gcc/config/riscv/riscv.cc:2786:10:
./genrtl.h:37:16: error: 'x' may be used uninitialized 
[-Werror=maybe-uninitialized]
37 |   XEXP (rt, 0) = arg0;
../../gcc/config/riscv/riscv.cc: In function 'void riscv_move_integer(rtx, rtx, 
long int, machine_mode)':
../../gcc/config/riscv/riscv.cc:2723:7: note: 'x' was declared here
  2723 |   rtx x;
   |   ^
cc1plus: all warnings being treated as errors
Thanks.  I guess the change in control flow in there does hide x's state 
pretty well.  It may not even be provable as initialized without knowing 
how this routine interacts with the costing phase that fills in the codes.


I'll take care of it.

Thanks again,
jeff



Re: [PATCH] ifcvt.cc: Prevent excessive if-conversion for conditional moves

2024-06-09 Thread Jeff Law




On 6/9/24 5:28 AM, YunQiang Su wrote:

YunQiang Su  于2024年6月9日周日 18:25写道:




gcc/ChangeLog:

   * ifcvt.cc (cond_move_process_if_block):
   Consider the result of targetm.noce_conversion_profitable_p()
   when replacing the original sequence with the converted one.

THanks.  I pushed this to the trunk.



Sorry for the delay report. With this patch the test
gcc.target/mips/movcc-3.c fails.



The problem may be caused by the different of `seq` and `edge e`.
In `seq`, there may be a compare operation, while
`default_max_noce_ifcvt_seq_cost`
only count the branch operation.

The rtx_cost may consider the compare operation in `seq` as quite expensive.
Overall it sounds like a target issue to me -- ie, now that we're 
testing for profitability instead of just assuming it's profitable some 
targets need adjustment.  Either in their costing model or in the 
testsuite expectations.


Jeff



Re: [PING] [contrib] validate_failures.py: fix python 3.12 escape sequence warnings

2024-06-09 Thread Jeff Law




On 6/9/24 5:45 AM, Gabi Falk wrote:

Hi,

On Sat, Jun 08, 2024 at 03:34:02PM -0600, Jeff Law wrote:

On 5/14/24 8:12 AM, Gabi Falk wrote:

Hi,

This one still needs review:

https://inbox.sourceware.org/gcc-patches/20240415233833.104460-1-gabif...@gmx.com/

I think I just ACK'd an equivalent patch from someone else this week.


Looks like it hasn't been merged yet, and I couldn't find it in the
mailing list archive.
Anyway, I hope either one gets merged soon. :)
I'm sure it will.  The variant I asked is from someone with commit 
privs, so they'll push it to the tree when convenient for them.


jeff



Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-09 Thread Jeff Law



On 6/7/24 4:31 PM, Jeff Law wrote:



I've actually added it to my tester just to see if there's any fallout. 
It'll take a week to churn through the long running targets that 
bootstrap in QEMU, but the crosses should have data Monday.
The first round naturally didn't trigger anything because the option is 
off by default.  So I twiddled it to be on at -O1 and above.


epiphany-elf ICEs in gen_rtx_SUBREG with the attached .i file compiled 
with -O2:



root@577c7458c93a://home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-obj/newlib/epiphany-elf/newlib/libm/complex#
 epiphany-elf-gcc -O2 libm_a-cacos.i
during RTL pass: avoid_store_forwarding
../../../..//newlib-cygwin/newlib/libm/complex/cacos.c: In function 'cacos':
../../../..//newlib-cygwin/newlib/libm/complex/cacos.c:99:1: internal compiler 
error: in gen_rtx_SUBREG, at emit-rtl.cc:1032
0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>)
../../..//gcc/gcc/emit-rtl.cc:1032
0x614538 gen_rtx_SUBREG(machine_mode, rtx_def*, poly_int<1u, unsigned long>)
../../..//gcc/gcc/emit-rtl.cc:1030
0xe82216 process_forwardings
../../..//gcc/gcc/avoid-store-forwarding.cc:273
0xe82216 avoid_store_forwarding
../../..//gcc/gcc/avoid-store-forwarding.cc:489
0xe82667 execute
../../..//gcc/gcc/avoid-store-forwarding.cc:558



ft32-elf ICE'd in bitmap_check_index at various optimization levels:


FAIL: execute/pr108498-2.c   -O1  (internal compiler error: in 
bitmap_check_index, at sbitmap.h:104)
FAIL: execute/pr108498-2.c   -O1  (test for excess errors)
FAIL: execute/pr108498-2.c   -O2  (internal compiler error: in 
bitmap_check_index, at sbitmap.h:104)
FAIL: execute/pr108498-2.c   -O2  (test for excess errors)
FAIL: execute/pr108498-2.c   -O3 -g  (internal compiler error: in 
bitmap_check_index, at sbitmap.h:104)
FAIL: execute/pr108498-2.c   -O3 -g  (test for excess errors)



avr, c6x,

lm32-elf failed to build libgcc with an ICE in leaf_function_p, I 
haven't isolated that yet.



There were other failures as well.  But you've got a few to start with 
and we can retest pretty easily as the patch evolves.


jeff

# 0 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c"
# 1 
"//home/jlaw/jenkins/workspace/epiphany-elf/epiphany-elf-obj/newlib/epiphany-elf/newlib//"
# 0 ""
# 0 ""
# 1 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c"
# 77 "../../../..//newlib-cygwin/newlib/libm/complex/cacos.c"
# 1 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h"
 1 3 4
# 15 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/complex.h"
 3 4
# 1 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/cdefs.h"
 1 3 4
# 45 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/cdefs.h"
 3 4
# 1 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 1 3 4







# 1 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/features.h"
 1 3 4
# 28 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/features.h"
 3 4
# 1 "./_newlib_version.h" 1 3 4
# 29 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/sys/features.h"
 2 3 4
# 9 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 2 3 4
# 41 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4

# 41 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4
typedef signed char __int8_t;

typedef unsigned char __uint8_t;
# 55 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4
typedef short int __int16_t;

typedef short unsigned int __uint16_t;
# 77 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4
typedef long int __int32_t;

typedef long unsigned int __uint32_t;
# 103 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4
typedef long long int __int64_t;

typedef long long unsigned int __uint64_t;
# 134 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4
typedef signed char __int_least8_t;

typedef unsigned char __uint_least8_t;
# 160 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/include/machine/_default_types.h"
 3 4
typedef short int __int_least16_t;

typedef short unsigned int __uint_least16_t;
# 182 
"/home/jlaw/jenkins/workspace/epiphany-elf/newlib-cygwin/newlib/libc/inc

Re: [PATCH] [tree-prof] skip if errors were seen [PR113681]

2024-06-08 Thread Jeff Law




On 4/15/24 10:03 PM, Alexandre Oliva wrote:

On Mar 29, 2024, Alexandre Oliva  wrote:


On Mar 22, 2024, Jeff Law  wrote:

On 3/9/24 2:11 AM, Alexandre Oliva wrote:

ipa_tree_profile asserts that the symtab is in IPA_SSA state, but we
don't reach that state and ICE if e.g. ipa-strub passes report errors.
Skip this pass if errors were seen.
Regstrapped on x86_64-linux-gnu.  Ok to install?

for  gcc/ChangeLog
PR tree-optimization/113681
* tree-profiling.cc (pass_ipa_tree_profile::gate): Skip if
seen_errors.
for  gcc/testsuite/ChangeLog
PR tree-optimization/113681
* c-c++-common/strub-pr113681.c: New.

So I've really never dug into strub, but this would seem to imply that
an error from strub is non-fatal?



Yeah.  I believe that's no different from other passes.



Various other passes have seen_errors guards, but ipa-prof didn't.


Specifically, pass_build_ssa_passes in passes.cc is gated with
!seen_errors(), so we skip all the passes bundled in it, and don't
advance the symtab state to IPA_SSA.  So other passes that would require
IPA_SSA need to be gated similarly.


I suppose the insertion point for the strubm pass was one where others
passes didn't previously issue errors, so that wasn't an issue for
ipa-prof.  But now it is.


The patch needed adjustments to resolve conflicts with unrelated
changes.


[tree-prof] skip if errors were seen [PR113681]

ipa_tree_profile asserts that the symtab is in IPA_SSA state, but we
don't reach that state and ICE if e.g. ipa-strub passes report errors.
Skip this pass if errors were seen.

Regstrapped on x86_64-linux-gnu.  Ok to install?


for  gcc/ChangeLog

PR tree-optimization/113681
* tree-profiling.cc (pass_ipa_tree_profile::gate): Skip if
seen_errors.

for  gcc/testsuite/ChangeLog

PR tree-optimization/113681
* c-c++-common/strub-pr113681.c: New.

OK.
jeff



Re: [middle-end PATCH] Prefer PLUS over IOR in RTL expansion of multi-word shifts/rotates.

2024-06-08 Thread Jeff Law




On 1/18/24 12:54 PM, Roger Sayle wrote:


This patch tweaks RTL expansion of multi-word shifts and rotates to use
PLUS rather than IOR for disjunctive operations.  During expansion of
these operations, the middle-end creates RTL like (X<>C2)
where the constants C1 and C2 guarantee that bits don't overlap.
Hence the IOR can be performed by any any_or_plus operation, such as
IOR, XOR or PLUS; for word-size operations where carry chains aren't
an issue these should all be equally fast (single-cycle) instructions.
The benefit of this change is that targets with shift-and-add insns,
like x86's lea, can benefit from the LSHIFT-ADD form.

An example of a backend that benefits is ARC, which is demonstrated
by these two simple functions:

unsigned long long foo(unsigned long long x) { return x<<2; }

which with -O2 is currently compiled to:

foo:lsr r2,r0,30
 asl_s   r1,r1,2
 asl_s   r0,r0,2
 j_s.d   [blink]
 or_sr1,r1,r2

with this patch becomes:

foo:lsr r2,r0,30
 add2r1,r2,r1
 j_s.d   [blink]
 asl_s   r0,r0,2

unsigned long long bar(unsigned long long x) { return (x<<2)|(x>>62); }

which with -O2 is currently compiled to 6 insns + return:

bar:lsr r12,r0,30
 asl_s   r3,r1,2
 asl_s   r0,r0,2
 lsr_s   r1,r1,30
 or_sr0,r0,r1
 j_s.d   [blink]
 or  r1,r12,r3

with this patch becomes 4 insns + return:

bar:lsr r3,r1,30
 lsr r2,r0,30
 add2r1,r2,r1
 j_s.d   [blink]
 add2r0,r3,r0


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-01-18  Roger Sayle  

gcc/ChangeLog
 * expmed.cc (expand_shift_1): Use add_optab instead of ior_optab
 to generate PLUS instead or IOR when unioning disjoint bitfields.
 * optabs.cc (expand_subword_shift): Likewise.
 (expand_binop): Likewise for double-word rotate.
Also note that on some targets like RISC-V, there's more freedom to 
generate compressed instructions from "and" rather than "or".


Anyway, given the time elapsed since submission, I went ahead and 
retested on x86, then committed & pushed to the trunk.


Thanks!

jeff



Re: [RFC/RFA] [PATCH 08/12] Add a new pass for naive CRC loops detection

2024-06-08 Thread Jeff Law




On 5/29/24 5:12 AM, Mariam Arutunian wrote:



IIRC we looked at the problem of canonicalizing the loop into a form
where we didn't necessarily have conditional blocks, instead we had
branchless sequences for the conditional xor and dealing with the high
bit in the crc.  My recollection was that the coremark CRC loop would
always canonicalize, but that in general we still saw multiple CRC
implementations that did not canonicalize and thus we still needed the
more complex matching.  Correct?


The loop in CoreMark is not fully canonicalized in that form,
as there are still branches present for the conditional XOR operation.
I checked that using the -O2 and -O3 flags.
A bit of a surprise.  Though it may be the case that some of the 
canonicalization steps are happening later in the pipeline.  No worries 
as I think we'd already concluded that we'd see at least some CRC 
implementations that wouldn't canonicalize down to branchless sequences 
for the conditional xor.






 > +
 > +gimple *
 > +crc_optimization::find_shift_after_xor (tree xored_crc)
 > +{
 > +  imm_use_iterator imm_iter;
 > +  use_operand_p use_p;
 > +
 > +  if (TREE_CODE (xored_crc) != SSA_NAME)
 > +    return nullptr;
If we always expect XORED_CRC to be an SSA_NAME, we might be able to use
gcc_assert TREE_CODE (XORED_CRC) == SSA_NAME);

I'm not sure that it always has to be an SSA_NAME.

For a logical operation like XOR it should always have the form

SSA_NAME = SSA_NAME ^ (SSA_NAME | CONSTANT)

The constant might be a vector  constant, but the basic form won't 
change.  It's one of the nicer properties of gimple.  In contrast RTL 
would allow a variety of lvalues and rvalues, including MEMs, REGs, 
SUBREGs, extensions, other binary ops, etc etc.




 > +
 > +/* Set M_PHI_FOR_CRC and M_PHI_FOR_DATA fields.
 > +   Returns false if there are more than two (as in CRC
calculation only CRC's
 > +   and data's phi may exist) or no phi statements in STMTS (at
least there must
 > +   be CRC's phi).
 > +   Otherwise, returns true.  */
 > +
 > +bool
 > +crc_optimization::set_crc_and_data_phi (auto_vec )
 > +{
 > +  for (auto stmt_it = stmts.begin (); stmt_it != stmts.end ();
stmt_it++)
 > +    {
 > +      if (is_a (*stmt_it) && bb_loop_header_p (gimple_bb
(*stmt_it)))
 > +     {
 > +       if (!m_phi_for_crc)
 > +         m_phi_for_crc = as_a (*stmt_it);
 > +       else if (!m_phi_for_data)
 > +         m_phi_for_data = as_a (*stmt_it);
 > +       else
 > +         {
 > +           if (dump_file && (dump_flags & TDF_DETAILS))
 > +             fprintf (dump_file, "Xor-ed variable depends on
more than 2 "
 > +                                 "phis.\n");
 > +           return false;
 > +         }
 > +     }
 > +    }
 > +  return m_phi_for_crc;
Hmm.  For a given PHI, how do we know if it's for the data item or the
crc item, or something else (like a loop counter) entirely?



I trace the def-use chain upwards from the XOR statement to determine 
which PHI node corresponds to CRC and data.
Since we assume the loop calculates CRC, I expect only variables 
representing data and CRC to participate in these operations.
In the implementations I support, the loop counter is used only for the 
iteration.
Any misidentification of CRC and data would occur only if the loop 
doesn't calculate CRC, in which case next checks would fail, leading the 
algorithm to identify it as not CRC.


Here, the PHI nodes for CRC and data might be mixed in places.
I just assume that the first found PHI is CRC, second data.
I correctly determine them later with the | 
*swap_crc_and_data_if_needed*| function.

Ah, OK.  That probably deserves a comment in this code.


jeff


Re: [RFC/RFA] [PATCH 01/12] Implement internal functions for efficient CRC computation

2024-06-08 Thread Jeff Law




On 5/27/24 7:51 AM, Mariam Arutunian wrote:



I carefully reviewed the indentation of the code using different editors 
and viewers, and everything appeared correct.
I double-checked the specific sections mentioned, and they also looked 
right.

In this reply message I see that it's not correct. I'll try to fix it.

Thanks for double-checking.  It's one of the downsides of email based flows.

Jeff


Re: [RFC/RFA] [PATCH 08/12] Add a new pass for naive CRC loops detection

2024-06-08 Thread Jeff Law




On 6/4/24 7:41 AM, Mariam Arutunian wrote:
/Mariam, your thoughts on whether or not those two phases could handle a 
loop with two CRC calculations inside, essentially creating two calls to 
our new builtins? /


/
/

It is feasible, but it would likely demand considerable effort and 
additional work to implement effectively.
Thanks for the confirmation.  I suspect it likely doesn't come up often 
in practice either.






The key would be to only simulate the use-def cycle from the loop-closed PHI 
(plus the loop control of course, but miter/SCEV should be enough there) and 
just replace that LC PHI, leaving loop DCE to DCE.


Thank you, this is a good idea to just replace the PHI and leave the loop to 
DCE to remove only single CRC parts.
It does seem like replacing the PHI when we have an optimizable case 
might simplify that aspect of the implementation.





The current pass only verifies cases where a single CRC calculation is 
performed within the loop. During the verification phase,
I ensure that there are no other calculations aside from those necessary for 
the considered CRC computation.

Also, when I was investigating the bitwise CRC implementations used in 
different software, in all cases the loop was calculating just one CRC and no 
other calculations were done.
Thus, in almost all cases, the first phase will filter out non-CRCs, and during 
the second phase, only real CRCs with no other calculations will be executed.
This ensures that unnecessary statements won't be executed in most cases.
But we may have had a degree of sampling bias here.  If I remember 
correctly I used the initial filtering pass as the "trigger" to report a 
potential CRC case.  If that initial filtering pass rejected cases with 
other calculations in the loop, then we never would have seen those.




Leaving the loop to DCE will simplify the process of removing parts connected 
to a single CRC calculation.
However, since now we detect a loop that only calculates a single CRC, we can 
entirely remove it at this stage without additional checks.
Let's evaluate this option as we get to the later patches in the series. 
 What I like about Richard's suggestion is that it "just works" and it 
will continue to work, even as the overall infrastructure changes.  In 
contrast a bespoke loop removal implementation in a specific pass may 
need adjustment if other aspects of our infrastructure change.






If we really want a separate pass (or utility to work on a single 
loop) then we might consider moving some of the final value replacement 
code that doesn’t work with only SCEV there as well. There’s also 
special code in loop distribution for strlen recognition now, not 
exactly fitting in. >



Note I had patches to do final value replacement on demand from CD-DCE when it 
figures a loop has no side effects besides of its reduction outputs (still want 
to pick this up at some point again).


Oh, this could provide useful insights for our implementation.
Are you thinking of reusing that on-demand analysis to reduce the set of 
loops we analyze?


Jeff



Re: [PING] [contrib] validate_failures.py: fix python 3.12 escape sequence warnings

2024-06-08 Thread Jeff Law




On 5/14/24 8:12 AM, Gabi Falk wrote:

Hi,

This one still needs review:

https://inbox.sourceware.org/gcc-patches/20240415233833.104460-1-gabif...@gmx.com/

I think I just ACK'd an equivalent patch from someone else this week.

jeff



Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-06-08 Thread Jeff Law




On 3/1/24 1:12 AM, Demin Han wrote:

Hi juzhe,

I also thought it’s related to commutive firstly.

Following things make me to do the removal:

1.No tests fails in regression

2.When I write if (a == 2) and if (2 == a), the results are same

GCC canonicalizes comparisons so that constants appear second.

Jeff


Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-06-08 Thread Jeff Law




On 2/29/24 11:27 PM, demin.han wrote:

We can unify eqne and other comparison operations.

Tested on RV32 and RV64

gcc/ChangeLog:

* config/riscv/riscv-vector-builtins-bases.cc: Remove eqne cond
* config/riscv/vector.md (@pred_eqne_scalar): Remove patterns
(*pred_eqne_scalar_merge_tie_mask): Ditto
(*pred_eqne_scalar): Ditto
(*pred_eqne_scalar_narrow): Ditto
So I'll tentatively ACK this for the trunk, assuming Robin doesn't 
object before Tuesday's patchwork meeting.


jeff




Re: [PATCH 1/5] RISC-V: Remove float vector eqne pattern

2024-06-08 Thread Jeff Law




On 5/16/24 1:21 PM, Robin Dapp wrote:

Can eqne pattern removal patches be committed firstly?


Please first make sure you test with corner cases, NaNs in
particular.  I'm pretty sure we don't have any test cases for
those.

But isn't canonicalization of EQ/NE safe, even for IEEE NaN and +-0.0?

target = (a == b) ? x : y
target = (a != b) ? y : x

Are equivalent, even for IEEE IIRC.

jeff




Re: [PATCH 0/2] fix RISC-V zcmp popretz [PR113715]

2024-06-08 Thread Jeff Law




On 6/5/24 8:42 PM, Fei Gao wrote:


But let's back up and get a good explanation of what the problem is.
Based on patch 2/2 it looks like we have lost an assignment to the
return register.

To someone not familiar with this code, it sounds to me like we've made
a mistake earlier and we're now defining a hook that lets us go back and
fix that earlier mistake.   I'm probably wrong, but so far that's what
it sounds like.

Hi Jeff

You're right. Let me rephrase  patch 2/2 with more details. Search /* feigao to 
location the point I'm
tring to explain.

code snippets from gcc/function.cc
void
thread_prologue_and_epilogue_insns (void)
{
...
   /*feigao:
         targetm.gen_epilogue () is called here to generate epilogue sequence.

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=b27d323a368033f0b37e93c57a57a35fd9997864
Commit above tries in targetm.gen_epilogue () to detect if
there's li  a0,0 insn at the end of insn chain, if so, cm.popret
is replaced by cm.popretz and lia0,0 insn is deleted.
So that seems like the critical issue.  Generation of the 
prologue/epilogue really shouldn't be changing other instructions in the 
instruction stream.  I'm not immediately aware of another target that 
does that, an it seems like a rather risky thing to do.



It looks like the cm.popretz's RTL exposes the assignment to a0 and 
there's a DCE pass that runs after insertion of the prologue/epilogue. 
So I would suggest leaving the assignment to a0 in the RTL chain and see 
if the later DCE pass after prologue generation eliminates the redundant 
assignment.  That seems a lot cleaner.




Jeff


Re: [PATCH] haifa-sched: Avoid the fusion priority of the fused insn to affect the subsequent insn sequence.

2024-06-08 Thread Jeff Law




On 6/6/24 8:51 PM, Jin Ma wrote:


I am very sorry that I did not check the commit information carefully. The 
statement is somewhat inaccurate.


When the insn 1 and 2, 3 and 4 can be fusioned, then there is the
following sequence:

;;    insn |
;;      1  | sp=sp-0x18
;;  +   2  | [sp+0x10]=ra
;;      3  | [sp+0x8]=s0
;;      4  | [sp+0x0]=s1



The fusion priority of the insn 2, 3, and 4 are the same. According to
the current algorithm, since abs(0x10-0x8)


;;    insn |
;;      1  | sp=sp-0x18
;;  +   2  | [sp+0x10]=ra
;;      4  | [sp+0x8]=s1
;;  +   3  | [sp+0x0]=s0

gcc/ChangeLog:



  * haifa-sched.cc (rank_for_schedule): Likewise.


When the insn 1 and 2, 4 and 3 can be fusioned, then there is the
following sequence:

;;    insn |
;;      1  | sp=sp-0x18
;;  +   2  | [sp+0x10]=ra
;;      3  | [sp+0x8]=s0
;;      4  | [sp+0x0]=s1

The fusion priority of the insn 2, 3, and 4 are the same. According to
the current algorithm, since abs(0x10-0x8)I'd really love to see a testcase here, particularly since I'm still 
having trouble understanding the code you're currently getting vs the 
code you want.


Furthermore, I think I need to understand the end motivation here.  I 
always think of fusion priority has bringing insns consecutive so that 
peephole pass can then squash two more more insns into a single insn. 
THe canonical case being load/store pairs.



If you're trying to generate pairs, then that's fine.  I just want to 
make sure I understand the goal.  And if you're trying to generate pairs 
what actually can be paired?  I must admit I don't have any notable 
experience with the thead core extensions.


If you're just trying to keep the instructions consecutive in the IL, 
then I don't think fusion priorities are a significant concern.  Much 
more important for that case is the fusion pair detection (which I think 
is about to get a lot more attention in the near future).


Jeff



Re: Reverted recent patches to resource.cc

2024-06-08 Thread Jeff Law




On 5/29/24 8:07 PM, Jeff Law wrote:



On 5/29/24 7:28 PM, Hans-Peter Nilsson wrote:

From: Hans-Peter Nilsson 
Date: Mon, 27 May 2024 19:51:47 +0200



2: Does not depend on 1, but corrects an incidentally found wart:
find_basic_block calls fails too often.  Replace it with "modern"
insn-to-basic-block cross-referencing.

3: Just an addendum to 2: removes an "if", where the condition is now
always-true, dominated by a gcc_assert, and where the change in
indentation was too ugly.

4: Corrects another incidentally found wart: for the last 15 years the
code in resource.cc has only been called from within reorg.cc (and
reorg.c), specifically not possibly before calling init_resource_info
or after free_resource_info, so we can discard the code that tests
certain allocated arrays for NULL.  I didn't even bother with a
gcc_assert; besides some gen*-generated files, only reorg.cc includes
resource.h (not to be confused with the system sys/resource.h).
A grep says the #include resource.h can be removed from those gen*
files and presumably from RESOURCE_H(!) as well.  Some Other Time.
Also, removed a redundant "if (tinfo != NULL)" and moved the then-code
into the previous then-clause.

   resource.cc: Replace calls to find_basic_block with cfgrtl
 BLOCK_FOR_INSN
   resource.cc (mark_target_live_regs): Remove check for bb not found
   resource.cc: Remove redundant conditionals


I had to revert those last three patches due to PR
bootstrap/115284.  I hope to revisit once I have a means to
reproduce (and fix) the underlying bug.  It doesn't have to
be a bug with those changes per-se: IMHO the "improved"
lifetimes could just as well have uncovered a bug elsewhere
in reorg.  It's still on me to resolve that situation; done.
I'm just glad the cause was the incidental improvements and
not the original bug I wanted to fix.

There appears to be only a single supported SPARC machine in
cfarm: cfarm216, and I currently can't reach it due to what
appears to be issues at my end.  I guess I'll either fix
that or breathe life into sparc-elf+sim.

Or if you've got a reasonable server to use, QEMU might save you :-)



Even better option.  The sh4/sh4eb-linux-gnu ports with 
execute/ieee/fp-cmp-5.c test.  That started execution failing at -O2 
with the first patch in the series and there are very clear assembly 
differences before/after your change.  Meaning you can probably look at 
them with just a cross compile and compare the before/after.



Jeff


Re: [RFC/RFA] [PATCH 03/12] RISC-V: Add CRC expander to generate faster CRC.

2024-06-08 Thread Jeff Law




On 6/8/24 1:53 AM, Richard Sandiford wrote:



I realise there are many ways of writing this out there though,
so that's just a suggestion.  (And only lightly tested.)

FWIW, we could easily extend the interface to work on wide_ints if we
ever need it for N>63.
I think there's constraints elsewhere that keep us in the N<=63 range. 
If we extended things elsewhere to include TI then we could fully 
support 64bit CRCs.


I don't *think* it's that hard, but we haven't actually tried.

Jeff




Re: [PATCH v2 3/3] RISC-V: Add Zalrsc amo-op patterns

2024-06-07 Thread Jeff Law




On 6/3/24 3:53 PM, Patrick O'Neill wrote:

All amo patterns can be represented with lrsc sequences.
Add these patterns as a fallback when Zaamo is not enabled.

gcc/ChangeLog:

* config/riscv/sync.md (atomic_): New expand 
pattern.
(amo_atomic_): Rename amo pattern.
(atomic_fetch_): New lrsc sequence pattern.
(lrsc_atomic_): New expand pattern.
(amo_atomic_fetch_): Rename amo pattern.
(lrsc_atomic_fetch_): New lrsc sequence pattern.
(atomic_exchange): New expand pattern.
(amo_atomic_exchange): Rename amo pattern.
(lrsc_atomic_exchange): New lrsc sequence pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/amo-zaamo-preferred-over-zalrsc.c: New test.
* gcc.target/riscv/amo-zalrsc-amo-add-1.c: New test.
* gcc.target/riscv/amo-zalrsc-amo-add-2.c: New test.
* gcc.target/riscv/amo-zalrsc-amo-add-3.c: New test.
* gcc.target/riscv/amo-zalrsc-amo-add-4.c: New test.
* gcc.target/riscv/amo-zalrsc-amo-add-5.c: New test.

Signed-off-by: Patrick O'Neill 
--
rv64imfdc_zalrsc has the same testsuite results as rv64imafdc after this
patch is applied.
---
AFAIK there isn't a way to subtract an extension similar to dg-add-options.
As a result I needed to specify a -march string for
amo-zaamo-preferred-over-zalrsc.c instead of using testsuite infra.

I believe you are correct.




diff --git a/gcc/testsuite/gcc.target/riscv/amo-zaamo-preferred-over-zalrsc.c 
b/gcc/testsuite/gcc.target/riscv/amo-zaamo-preferred-over-zalrsc.c
new file mode 100644
index 000..1c124c2b8b1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/amo-zaamo-preferred-over-zalrsc.c

[ ... ]
Not a big fan of the function-bodies tests.  If we're going to use them, 
we need to be especially careful about requiring specific registers so 
that we're not stuck adjusting them all the time due to changes in the 
regsiter allocator, optimizers, etc.



diff --git a/gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-1.c 
b/gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-1.c
new file mode 100644
index 000..3cd6ce04830
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/amo-zalrsc-amo-add-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* Verify that lrsc atomic op mappings match Table A.6's recommended mapping.  
*/
+/* { dg-options "-O3 -march=rv64id_zalrsc" } */
+/* { dg-skip-if "" { *-*-* } { "-g" "-flto"} } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+/*
+** foo:
+** 1:
+** lr.w\ta5, 0\(a0\)
+** add\ta5, a5, a1
+** sc.w\ta5, a5, 0\(a0\)
+**  bnez\ta5, 1b
+** ret
+*/
+void foo (int* bar, int* baz)
+{
+  __atomic_add_fetch(bar, baz, __ATOMIC_RELAXED);
+}
This one is a good example.  We could just as easily use a variety of 
registers other than a5 for the temporary.


Obviously for registers that hold the incoming argument or an outgoing 
result, we can be more strict.


If you could take a look at the added tests and generalize the registers 
it'd be appreciated.  OK with that adjustment.


jeff




Re: [PATCH v2 2/3] RISC-V: Add Zalrsc and Zaamo testsuite support

2024-06-07 Thread Jeff Law




On 6/3/24 3:53 PM, Patrick O'Neill wrote:

Convert testsuite infrastructure to use Zalrsc and Zaamo rather than A.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/amo-table-a-6-amo-add-1.c: Use Zaamo rather than A.
* gcc.target/riscv/amo-table-a-6-amo-add-2.c: Ditto.
* gcc.target/riscv/amo-table-a-6-amo-add-3.c: Ditto.
* gcc.target/riscv/amo-table-a-6-amo-add-4.c: Ditto.
* gcc.target/riscv/amo-table-a-6-amo-add-5.c: Ditto.
* gcc.target/riscv/amo-table-a-6-compare-exchange-1.c: Use Zalrsc rather
than A.
* gcc.target/riscv/amo-table-a-6-compare-exchange-2.c: Ditto.
* gcc.target/riscv/amo-table-a-6-compare-exchange-3.c: Ditto.
* gcc.target/riscv/amo-table-a-6-compare-exchange-4.c: Ditto.
* gcc.target/riscv/amo-table-a-6-compare-exchange-5.c: Ditto.
* gcc.target/riscv/amo-table-a-6-compare-exchange-6.c: Ditto.
* gcc.target/riscv/amo-table-a-6-compare-exchange-7.c: Ditto.
* gcc.target/riscv/amo-table-a-6-subword-amo-add-1.c: Use Zaamo rather
than A.
* gcc.target/riscv/amo-table-a-6-subword-amo-add-2.c: Ditto.
* gcc.target/riscv/amo-table-a-6-subword-amo-add-3.c: Ditto.
* gcc.target/riscv/amo-table-a-6-subword-amo-add-4.c: Ditto.
* gcc.target/riscv/amo-table-a-6-subword-amo-add-5.c: Ditto.
* gcc.target/riscv/amo-table-ztso-amo-add-1.c: Add Zaamo option.
* gcc.target/riscv/amo-table-ztso-amo-add-2.c: Ditto.
* gcc.target/riscv/amo-table-ztso-amo-add-3.c: Ditto.
* gcc.target/riscv/amo-table-ztso-amo-add-4.c: Ditto.
* gcc.target/riscv/amo-table-ztso-amo-add-5.c: Ditto.
* gcc.target/riscv/amo-table-ztso-compare-exchange-1.c: Use Zalrsc 
rather
than A.
* gcc.target/riscv/amo-table-ztso-compare-exchange-2.c: Ditto.
* gcc.target/riscv/amo-table-ztso-compare-exchange-3.c: Ditto.
* gcc.target/riscv/amo-table-ztso-compare-exchange-4.c: Ditto.
* gcc.target/riscv/amo-table-ztso-compare-exchange-5.c: Ditto.
* gcc.target/riscv/amo-table-ztso-compare-exchange-6.c: Ditto.
* gcc.target/riscv/amo-table-ztso-compare-exchange-7.c: Ditto.
* gcc.target/riscv/amo-table-ztso-subword-amo-add-1.c: Ditto.
* gcc.target/riscv/amo-table-ztso-subword-amo-add-2.c: Ditto.
* gcc.target/riscv/amo-table-ztso-subword-amo-add-3.c: Ditto.
* gcc.target/riscv/amo-table-ztso-subword-amo-add-4.c: Ditto.
* gcc.target/riscv/amo-table-ztso-subword-amo-add-5.c: Ditto.
* lib/target-supports.exp: Add testsuite infrastructure support for
Zaamo and Zalrsc.
So there's a lot of whitespace changes going on in target-supports.exp 
that make it harder to find the real changes.


There's always a bit of a judgement call for that kind of thing.  This 
one probably goes past would generally recommend, meaning that the 
formatting stuff would be a separate patch.


A reasonable starting point would be if you're not changing the function 
in question, then fixing formatting in it probably should be a distinct 
patch.


You probably should update the docs in sourcebuild.texi for the new 
target-supports tests.


So OK for the trunk (including the whitespace fixes) with a suitable 
change to sourcebuild.texi.


jeff


Re: [PATCH v2 1/3] RISC-V: Add basic Zaamo and Zalrsc support

2024-06-07 Thread Jeff Law




On 6/3/24 3:53 PM, Patrick O'Neill wrote:

The A extension has been split into two parts: Zaamo and Zalrsc.
This patch adds basic support by making the A extension imply Zaamo and
Zalrsc.

Zaamo/Zalrsc spec: https://github.com/riscv/riscv-zaamo-zalrsc/tags
Ratification: https://jira.riscv.org/browse/RVS-1995

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc: Add Zaamo and Zalrsc.
* config/riscv/arch-canonicalize: Make A imply Zaamo and Zalrsc.
* config/riscv/riscv.opt: Add Zaamo and Zalrsc
* config/riscv/sync.md: Convert TARGET_ATOMIC to TARGET_ZAAMO and
TARGET_ZALRSC.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/attribute-15.c: Adjust expected arch string.
* gcc.target/riscv/attribute-16.c: Ditto.
* gcc.target/riscv/attribute-17.c: Ditto.
* gcc.target/riscv/attribute-18.c: Ditto.
* gcc.target/riscv/pr110696.c: Ditto.
* gcc.target/riscv/rvv/base/pr114352-1.c: Ditto.
* gcc.target/riscv/rvv/base/pr114352-3.c: Ditto.

OK
jeff



Re: [PATCH v2] Target-independent store forwarding avoidance.

2024-06-07 Thread Jeff Law




On 6/6/24 4:10 AM, Manolis Tsamis wrote:

This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

  strbw2, [x1, 1]
  ldr x0, [x1]  # Expensive store forwarding to larger load.

To:

  ldr x0, [x1]
  strbw2, [x1]
  bfi x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

   Neoverse-N1:  +29.4%
   Intel Coffeelake: +13.1%
   AMD 5950X:+17.5%

gcc/ChangeLog:

* Makefile.in: Add avoid-store-forwarding.o.
* common.opt: New option -favoid-store-forwarding.
* params.opt: New param store-forwarding-max-distance.
* doc/invoke.texi: Document new pass.
* doc/passes.texi: Document new pass.
* passes.def: Schedule a new pass.
* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
* avoid-store-forwarding.cc: New file.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
* gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
So this is getting a lot more interesting.  I think the first time I 
looked at this it was more concerned with stores feeding something like 
a load-pair and avoiding the store forwarding penalty for that case.  Am 
I mis-remembering, or did it get significantly more general?







+
+static unsigned int stats_sf_detected = 0;
+static unsigned int stats_sf_avoided = 0;
+
+static rtx
+get_load_mem (rtx expr)
Needs a function comment.  You should probably mention that EXPR must be 
a single_set in that comment.




 +

+  rtx dest;
+  if (eliminate_load)
+dest = gen_reg_rtx (load_inner_mode);
+  else
+dest = SET_DEST (load);
+
+  int move_to_front = -1;
+  int total_cost = 0;
+
+  /* Check if we can emit bit insert instructions for all forwarded stores.  */
+  FOR_EACH_VEC_ELT (stores, i, it)
+{
+  it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+  rtx_insn *insns = NULL;
+
+  /* If we're eliminating the load then find the store with zero offset
+and use it as the base register to avoid a bit insert.  */
+  if (eliminate_load && it->offset == 0)
So often is this triggering?  We have various codes in the gimple 
optimizers to detect store followed by a load from the same address and 
do the forwarding.  If they're happening with any frequency that would 
be a good sign code in DOM and elsewhere isn't working well.


THe way these passes detect this case is to take store, flip the 
operands around (ie, it looks like a load) and enter that into the 
expression hash tables.  After that standard redundancy elimination 
approaches will work.




+   {
+ start_sequence ();
+
+ /* We can use a paradoxical subreg to force this to a wider mode, as
+the only use will be inserting the bits (i.e., we don't care about
+the value of the higher bits).  */
Which may be a good hint about the cases you're capturing -- if the 
modes/sizes differ that would make more sense since I don't think we're 
as likely to be capturing those cases.




diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4e8967fd8ab..c769744d178 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12657,6 +12657,15 @@ loop unrolling.
  This option is enabled by default at optimization levels @option{-O1},
  @option{-O2}, @option{-O3}, @option{-Os}.
  
+@opindex favoid-store-forwarding

+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
Is there any particular reason why this would be off by default at -O1 
or higher?  It would seem to me that on modern cores that this 
transformation should easily be a win.  Even on an old in-order core, 
avoiding the load with the bit insert is likely profitable, just not as 
much so.






diff --git a/gcc/params.opt b/gcc/params.opt
index d34ef545bf0..b8115f5c27a 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned 
stores if it is legal to do
  Common Joined UInteger Var(param_store_merging_max_size) Init(65536) 
IntegerRange(1, 65536) Param Optimization
  Maximum size of a single store merging region in bytes.
  
+-param=store-forwarding-max-distance=

+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) 
IntegerRange(1, 1000) Param Optimization
+Maximum number of 

Re: [PATCH] expmed: TRUNCATE value1 if needed in store_bit_field_using_insv

2024-06-05 Thread Jeff Law




On 6/5/24 8:57 AM, YunQiang Su wrote:

Richard Sandiford  于2024年6月5日周三 22:14写道:


YunQiang Su  writes:

PR target/113179.

In `store_bit_field_using_insv`, we just use SUBREG if value_mode

= op_mode, while in some ports, a sign_extend will be needed,

such as MIPS64:
   If either GPR rs or GPR rt does not contain sign-extended 32-bit
   values (bits 63..31 equal), then the result of the operation is
   UNPREDICTABLE.

The problem happens for the code like:
   struct xx {
 int a:4;
 int b:24;
 int c:3;
 int d:1;
   };

   void xx (struct xx *a, long long b) {
 a->d = b;
   }

In the above code, the hard register contains `b`, may be note well
sign-extended.

gcc/
   PR target/113179
   * expmed.c(store_bit_field_using_insv): TRUNCATE value1 if
   needed.

gcc/testsuite
   PR target/113179
   * gcc.target/mips/pr113179.c: New tests.
---
  gcc/expmed.cc| 12 +---
  gcc/testsuite/gcc.target/mips/pr113179.c | 18 ++
  2 files changed, 27 insertions(+), 3 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/mips/pr113179.c

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 4ec035e4843..6a582593da8 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -704,9 +704,15 @@ store_bit_field_using_insv (const extraction_insn *insv, 
rtx op0,
   }
 else
   {
-   tmp = gen_lowpart_if_possible (op_mode, value1);
-   if (! tmp)
- tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
+   if (targetm.mode_rep_extended (op_mode, value_mode))
+ tmp = simplify_gen_unary (TRUNCATE, op_mode,
+   value1, value_mode);
+   else
+ {
+   tmp = gen_lowpart_if_possible (op_mode, value1);
+   if (! tmp)
+ tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
+ }
   }
 value1 = tmp;
   }


I notice this patch is already applied.  Was it approved?  I didn't
see an approval in my feed or in the archives.



Sorry. I was supposed that it only effects MIPS targets since only MIPS defines
   targetm.mode_rep_extended
Just a note for the future, even if something is guarded by a target 
hook that's only defined by a single target we'd want to see an ACK if 
the code is in a target independent file.


Jeff


Re: [V2 PATCH] Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for vector mode.

2024-06-05 Thread Jeff Law




On 6/4/24 10:22 PM, liuhongt wrote:

Can you add a testcase for this?  I don't mind if it's x86 specific and
does a bit of asm scanning.

Also note that the context for this patch has changed, so it won't
automatically apply.  So be extra careful when updating so that it goes
into the right place (all the more reason to have a testcase validating
that the optimization works correctly).


I think the patch itself is fine.  So further review is just for the
testcase and should be easy.

rebased and add a testcase.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?


When mask is (1 << (prec - imm) - 1) which is used to clear upper bits
of A, then it can be simplified to LSHIFTRT.

i.e Simplify
(and:v8hi
   (ashifrt:v8hi A 8)
   (const_vector 0xff x8))
to
(lshifrt:v8hi A 8)

gcc/ChangeLog:

PR target/114428
* simplify-rtx.cc
(simplify_context::simplify_binary_operation_1):
Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for
specific mask.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr114428-1.c: New test.

OK.

Being x264 related, I took a quick glance at RISC-V before/after and 
seems to be slightly better as well.


Jeff


Re: [PATCH] Record edge true/false value for gcov

2024-06-05 Thread Jeff Law




On 6/4/24 6:26 AM, Jørgen Kvalsvik wrote:

Make gcov aware which edges are the true/false to more accurately
reconstruct the CFG.  There are plenty of bits left in arc_info and it
opens up for richer reporting.

gcc/ChangeLog:

* gcov-io.h (GCOV_ARC_TRUE): New.
(GCOV_ARC_FALSE): New.
* gcov.cc (struct arc_info): Add true_value, false_value.
(read_graph_file): Read true_value, false_value.
Going to trust you'll find this useful in the near future :-)  So OK for 
the trunk.


jeff



Re: [PATCH] expmed: TRUNCATE value1 if needed in store_bit_field_using_insv

2024-06-05 Thread Jeff Law




On 6/5/24 8:14 AM, Richard Sandiford wrote:

YunQiang Su  writes:

PR target/113179.

In `store_bit_field_using_insv`, we just use SUBREG if value_mode

= op_mode, while in some ports, a sign_extend will be needed,

such as MIPS64:
   If either GPR rs or GPR rt does not contain sign-extended 32-bit
   values (bits 63..31 equal), then the result of the operation is
   UNPREDICTABLE.

The problem happens for the code like:
   struct xx {
 int a:4;
 int b:24;
 int c:3;
 int d:1;
   };

   void xx (struct xx *a, long long b) {
 a->d = b;
   }

In the above code, the hard register contains `b`, may be note well
sign-extended.

gcc/
PR target/113179
* expmed.c(store_bit_field_using_insv): TRUNCATE value1 if
needed.

gcc/testsuite
PR target/113179
* gcc.target/mips/pr113179.c: New tests.
---
  gcc/expmed.cc| 12 +---
  gcc/testsuite/gcc.target/mips/pr113179.c | 18 ++
  2 files changed, 27 insertions(+), 3 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/mips/pr113179.c

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 4ec035e4843..6a582593da8 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -704,9 +704,15 @@ store_bit_field_using_insv (const extraction_insn *insv, 
rtx op0,
}
  else
{
- tmp = gen_lowpart_if_possible (op_mode, value1);
- if (! tmp)
-   tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
+ if (targetm.mode_rep_extended (op_mode, value_mode))
+   tmp = simplify_gen_unary (TRUNCATE, op_mode,
+ value1, value_mode);
+ else
+   {
+ tmp = gen_lowpart_if_possible (op_mode, value1);
+ if (! tmp)
+   tmp = gen_lowpart (op_mode, force_reg (value_mode, value1));
+   }
}
  value1 = tmp;
}


I notice this patch is already applied.  Was it approved?  I didn't
see an approval in my feed or in the archives.
I don't see an approval and this patch was in my pending-patches box as 
unresolved.


Jeff


Re: [PATCH 0/2] fix RISC-V zcmp popretz [PR113715]

2024-06-05 Thread Jeff Law




On 6/5/24 1:47 AM, Fei Gao wrote:


On 2024-06-05 14:36  Kito Cheng  wrote:


Thanks for fixing this issue, and I am wondering doest it possible to
fix that without introduce target hook? I ask that because...GCC 14
also has this bug, but I am not sure it's OK to introduce new target
hook for release branch? or would you suggest we just revert patch to
fix that on GCC 14?


If hook is unacceptable in GCC14, I suggest to revert on GCC 14 the following 
commit.
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=b27d323a368033f0b37e93c57a57a35fd9997864

I started fixing this issue by adding changes in mach pass but abandoned it
due to the following reasons:
1. more codes to detect location of epilogue in the whole insn list.
2. due to impact by scheduling pass, clear a0 and use a0 insns get reordered, 
resulting in more
     codes.
3. data flow analysis is needed, but insn does't have bb info any more, so 
rescan actually does
     nothing, which I guess there's some hidden issue in 
riscv_remove_unneeded_save_restore_calls
     using dfa.

So I came up this hook based patch in prologue and epilogue pass to make the 
optimization
happen as earlier as possible. It ends up with simplicity and clear logic.
But let's back up and get a good explanation of what the problem is. 
Based on patch 2/2 it looks like we have lost an assignment to the 
return register.


To someone not familiar with this code, it sounds to me like we've made 
a mistake earlier and we're now defining a hook that lets us go back and 
fix that earlier mistake.   I'm probably wrong, but so far that's what 
it sounds like.


So let's get a good explanation of the problem and perhaps we'll find a 
better way to solve it.


jeff




Re: [PATCH 0/2] fix RISC-V zcmp popretz [PR113715]

2024-06-05 Thread Jeff Law




On 6/5/24 12:36 AM, Kito Cheng wrote:

Thanks for fixing this issue, and I am wondering doest it possible to
fix that without introduce target hook? I ask that because...GCC 14
also has this bug, but I am not sure it's OK to introduce new target
hook for release branch? or would you suggest we just revert patch to
fix that on GCC 14?
The question I would ask is why is the target hook needed.  ie, what 
problem needs to be solved and how does a new target hook help.




Jeff


Re: [PATCH-1v2] fwprop: Replace rtx_cost with insn_cost in try_fwprop_subst_pattern [PR113325]

2024-06-05 Thread Jeff Law




On 6/5/24 3:08 AM, Richard Sandiford wrote:

HAO CHEN GUI  writes:

Hi,
   This patch replaces rtx_cost with insn_cost in forward propagation.
In the PR, one constant vector should be propagated and replace a
pseudo in a store insn if we know it's a duplicated constant vector.
It reduces the insn cost but not rtx cost. In this case, the cost is
determined by destination operand (memory or pseudo). Unfortunately,
rtx cost can't help.

   The test case is added in the second target specific patch.
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643995.html

   Compared to previous version, the main change is not to do
substitution if either new or old insn cost is zero. The zero means
the cost is unknown.

  Previous version
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643994.html

   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
regressions. Is it OK for the trunk?

ChangeLog
fwprop: Replace set_src_cost with insn_cost in try_fwprop_subst_pattern

gcc/
* fwprop.cc (try_fwprop_subst_pattern): Replace set_src_cost with
insn_cost.


Thanks for doing this.  It's definitely the right direction, but:


patch.diff
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index cb6fd6700ca..184a22678b7 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -470,21 +470,19 @@ try_fwprop_subst_pattern (obstack_watermark , 
insn_change _change,
redo_changes (0);
  }

-  /* ??? In theory, it should be better to use insn costs rather than
- set_src_costs here.  That would involve replacing this code with
- change_is_worthwhile.  */


...as hinted at in the comment, rtl-ssa already has a routine for
insn_cost-based calculations.  It has two (supposed) advantages:
it caches the old costs, and it takes execution frequency into
account when optimising for speed.

The comment is out of date though.  The name of the routine is
changes_are_worthwhile rather than change_is_worthwhile.  Could you
try using that instead?
Funny, I went wandering around looking for that function to see how it 
was implemented and how it might compare to what was being proposed.


Of course I never found it, even after rewinding to various old git 
hashes that looked promising.


So, yea, definitely would prefer re-using changes_are_worthwhile if it 
works reasonably well for the issue at hand.


jeff



Re: [PATCH v4] RISC-V: Introduce -mvector-strict-align.

2024-06-04 Thread Jeff Law




On 5/28/24 1:19 PM, Robin Dapp wrote:

Hi,

this patch disables movmisalign by default and introduces
the -mno-vector-strict-align option to override it and re-enable
movmisalign.  For now, generic-ooo is the only uarch that supports
misaligned vector access.

The patch also adds a check_effective_target_riscv_v_misalign_ok to
the testsuite which enables or disables the vector misalignment tests
depending on whether the target under test can execute a misaligned
vle32.

Changes from v3:
  - Adressed Kito's comments.
  - Made -mscalar-strict-align a real alias.

Regards
  Robin

gcc/ChangeLog:

* config/riscv/riscv-opts.h (TARGET_VECTOR_MISALIGN_SUPPORTED):
Move from here...
* config/riscv/riscv.h (TARGET_VECTOR_MISALIGN_SUPPORTED):
...to here and map to riscv_vector_unaligned_access_p.
* config/riscv/riscv.opt: Add -mvector-strict-align.
* config/riscv/riscv.cc (struct riscv_tune_param): Add
vector_unaligned_access.
(riscv_override_options_internal): Set
riscv_vector_unaligned_access_p.
* doc/invoke.texi: Document -mvector-strict-align.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add
check_effective_target_riscv_v_misalign_ok.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-7.c: Add
-mno-vector-strict-align.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-10.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-11.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-12.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-8.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-9.c: Ditto.
* gcc.target/riscv/rvv/autovec/vls/misalign-1.c: Ditto.
So per the patchwork discussion this morning let's go ahead with this, 
knowing we may need to revisit for:


1. Coordination with LLVM on option naming/behavior.  LLVM will have a 
release before gcc-15, so if at all possible we should follow their lead 
on option naming.


2. Adjusting defaults once kernel unaligned trap handlers are in place.

Palmer is going to reach out to David on his team to tray and push 
things towards using generic-ooo tuning for Fedora on RISC-V.  I'll do 
the same with Ventana's contacts at Canonical (Heinrich & Gordon).


I expect we're better aligned with Fedora on this topic -- Fedora feeds 
RHEL which isn't likely to care about SBCs, so cores that Fedora is 
going to be the most interested in over time are much more likely to 
handle unaligned vector loads/stores in hardware.  So the path we want 
lines up with Fedora quite well, IMHO.


Canonical seems to be more interested in supporting these SBCs, so they 
may have a harder time with a default to ooo-generic since it'll either 
result in binaries that don't work (today) or have poor performance 
(future with kernel trap unaligned trap handlers updated).


Jeff


Re: [PATCH] Don't simplify NAN/INF or out-of-range constant for FIX/UNSIGNED_FIX.

2024-06-04 Thread Jeff Law




On 5/26/24 7:08 PM, liuhongt wrote:

Update in V2:
Guard constant folding for overflow value in
fold_convert_const_int_from_real with flag_trapping_math.
Add -fno-trapping-math to related testcases which warn for overflow
in conversion from floating point to integer.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

According to IEEE standard, for conversions from floating point to
integer. When a NaN or infinite operand cannot be represented in the
destination format and this cannot otherwise be indicated, the invalid
operation exception shall be signaled. When a numeric operand would
convert to an integer outside the range of the destination format, the
invalid operation exception shall be signaled if this situation cannot
otherwise be indicated.

The patch prevent simplication of the conversion from floating point
to integer for NAN/INF/out-of-range constant when flag_trapping_math.

gcc/ChangeLog:

PR rtl-optimization/100927
PR rtl-optimization/115161
PR rtl-optimization/115115
* simplify-rtx.cc (simplify_const_unary_operation): Prevent
simplication of FIX/UNSIGNED_FIX for NAN/INF/out-of-range
constant when flag_trapping_math.
* fold-const.cc (fold_convert_const_int_from_real): Don't fold
for overflow value when_trapping_math.

gcc/testsuite/ChangeLog:

* gcc.dg/pr100927.c: New test.
* c-c++-common/Wconversion-1.c: Add -fno-trapping-math.
* c-c++-common/dfp/convert-int-saturate.c: Ditto.
* g++.dg/ubsan/pr63956.C: Ditto.
* g++.dg/warn/Wconversion-real-integer.C: Ditto.
* gcc.c-torture/execute/20031003-1.c: Ditto.
* gcc.dg/Wconversion-complex-c99.c: Ditto.
* gcc.dg/Wconversion-real-integer.c: Ditto.
* gcc.dg/c90-const-expr-11.c: Ditto.
* gcc.dg/overflow-warn-8.c: Ditto.

OK.  Thanks.

jeff




Re: [PATCH-1] fwprop: Replace rtx_cost with insn_cost in try_fwprop_subst_pattern [PR113325]

2024-06-04 Thread Jeff Law




On 1/25/24 6:16 PM, HAO CHEN GUI wrote:

Hi,
   This patch replaces rtx_cost with insn_cost in forward propagation.
In the PR, one constant vector should be propagated and replace a
pseudo in a store insn if we know it's a duplicated constant vector.
It reduces the insn cost but not rtx cost. In this case, the kind of
destination operand (memory or pseudo) decides the cost and rtx cost
can't reflect it.

   The test case is added in the second target specific patch.

   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
regressions. Is it OK for next stage 1?

Thanks
Gui Haochen


ChangeLog
fwprop: Replace rtx_cost with insn_cost in try_fwprop_subst_pattern

gcc/
PR target/113325
* fwprop.cc (try_fwprop_subst_pattern): Replace rtx_cost with
insn_cost.

Testcase?  I don't care of it's ppc specific.

I think we generally want to move from rtx_cost to insn_cost, so I think 
the change itself is fine.  We just want to make sure a test covers the 
change in some manner.


Also note this a change to generic code and could likely trigger 
failures on various targets that have assembler scanning tests.  So once 
you've got a testcase and the full patch is ack'd we'll need to watch 
closely for regressions reported on other targets.



So ACK'd once you add a testcase.

Jeff


Re: [PATCH 1/2] Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for vector mode.

2024-06-04 Thread Jeff Law




On 5/23/24 8:25 PM, Hongtao Liu wrote:

CC for review.

On Tue, May 21, 2024 at 1:12 PM liuhongt  wrote:


When mask is (1 << (prec - imm) - 1) which is used to clear upper bits
of A, then it can be simplified to LSHIFTRT.

i.e Simplify
(and:v8hi
   (ashifrt:v8hi A 8)
   (const_vector 0xff x8))
to
(lshifrt:v8hi A 8)

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok of trunk?

gcc/ChangeLog:

 PR target/114428
 * simplify-rtx.cc
 (simplify_context::simplify_binary_operation_1):
 Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for
 specific mask.


Can you add a testcase for this?  I don't mind if it's x86 specific and 
does a bit of asm scanning.


Also note that the context for this patch has changed, so it won't 
automatically apply.  So be extra careful when updating so that it goes 
into the right place (all the more reason to have a testcase validating 
that the optimization works correctly).



I think the patch itself is fine.  So further review is just for the 
testcase and should be easy.


jeff

ps.  It seems to help RISC-V as well :-)




Re: [PATCH] Add config file so b4 uses inbox.sourceware.org automatically

2024-06-03 Thread Jeff Law




On 5/23/24 9:49 AM, Jonathan Wakely wrote:

It looks like my patch[1] to make b4 figure this out automagically won't
be accepted, so this makes it work for GCC. A similar commit could be
done for each project hosted on sourceware.org if desired.

[1] 
https://lore.kernel.org/tools/20240523143752.385810-1-jwak...@redhat.com/T/#u

OK for trunk?

-- >8 --

This makes b4 use inbox.sourceware.org instead of the default host
lore.kernel.org, so that every b4 user doesn't have to configure this
themselves.

ChangeLog:

* .b4-config: New file.
Given the impact on anyone not using b4 is nil and it makes life easier 
for those using b4, this seems trivially OK for the trunk.


jeff



Re: [PATCH v2] RISC-V: Add Zfbfmin extension

2024-06-03 Thread Jeff Law




On 6/1/24 1:45 AM, Xiao Zeng wrote:

1 In the previous patch, the libcall for BF16 was implemented:


2 Riscv provides Zfbfmin extension, which completes the "Scalar BF16 Converts":


3 Implemented replacing libcall with Zfbfmin extension instruction.

4 Reused previous testcases in:

gcc/ChangeLog:

* config/riscv/iterators.md: Add mode_iterator between
floating-point modes and BFmode.
* config/riscv/riscv.cc (riscv_output_move): Handle BFmode move
for zfbfmin.
* config/riscv/riscv.md (truncbf2): New pattern for BFmode.
(extendbfsf2): Dotto.
(*movhf_hardfloat): Add BFmode.
(*mov_hardfloat): Dotto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/zfbfmin-bf16_arithmetic.c: New test.
* gcc.target/riscv/zfbfmin-bf16_comparison.c: New test.
* gcc.target/riscv/zfbfmin-bf16_float_libcall_convert.c: New test.
* gcc.target/riscv/zfbfmin-bf16_integer_libcall_convert.c: New test.

OK for the trunk.  Thanks!

jeff



Re: [PATCH] RISC-V: Add min/max patterns for ifcvt.

2024-06-03 Thread Jeff Law




On 6/3/24 11:03 AM, Palmer Dabbelt wrote:



+;; Provide a minmax pattern for ifcvt to match.
+(define_insn "*_cmp_3"
+  [(set (match_operand:X 0 "register_operand" "=r")
+    (if_then_else:X
+    (bitmanip_minmax_cmp_op
+    (match_operand:X 1 "register_operand" "r")
+    (match_operand:X 2 "register_operand" "r"))
+    (match_dup 1)
+    (match_dup 2)))]
+  "TARGET_ZBB"
+  "\t%0,%1,%z2"
+  [(set_attr "type" "")])


This is a bit different than how we're handling the other min/max type 
attributes


    (define_insn "*3"
  [(set (match_operand:X 0 "register_operand" "=r")
    (bitmanip_minmax:X (match_operand:X 1 "register_operand" "r")
   (match_operand:X 2 "reg_or_0_operand" "rJ")))]
  "TARGET_ZBB"
  "\t%0,%1,%z2"
  [(set_attr "type" "")])

but it looks like it ends up with the same types after all the iterators 
(there's some "max vs smax" and "smax vs maxs" juggling, but IIUC it 
ends up in the same place).  I think it'd be clunkier to try and use all 
the same iterators, though, so


Reviewed-by: Palmer Dabbelt 

[I was wondering if we need the reversed, Jeff on the call says we 
don't.  I couldn't figure out how to write a test for it.]
Right.  I had managed to convince myself that we didn't need the 
reversed case.  I'm less sure now than I was earlier, but I'm also 
confident that if the need arises we can trivially handle it.  At some 
point there's canonicalization of the condition and that's almost 
certainly what's making it hard to synthesize a testcase for the 
reversed pattern.


The other thing I pondered was whether or not we should support SImode 
min/max on rv64.  It was critical for simplifying that abs2 routine in 
x264, but I couldn't convince myself it was needed here.  So I just set 
it aside and didn't mention it.


jeff


Re: [PATCH] check_GNU_style: Use raw strings.

2024-06-03 Thread Jeff Law




On 5/31/24 1:38 PM, Robin Dapp wrote:

Hi,

this silences some warnings when using check_GNU_style.

I didn't expect this to have any bootstrap or regtest impact
but I still ran it on x86 - no change.

Regards
  Robin

contrib/ChangeLog:

* check_GNU_style_lib.py: Use raw strings for regexps.

OK
jeff



Re: RISC-V: Fix round_32.c test on RV32

2024-05-31 Thread Jeff Law




On 5/27/24 4:17 PM, Jivan Hakobyan wrote:

Ya, makes sense -- I guess the current values aren't that exciting for
execution, but we could just add some more interesting ones...


During the development of the patch, I have an issue with large
numbers (2e34, -2e34). They are used in gfortran.fortran-torture/
execute/ intrinsic_aint_anint.f90 test. Besides that, a benchmark
from Spec 2017 also failed (can not remember which one), Now we
haven't an issue with them, Of course, I can add additional tests
with large numbers. But it will be double-check (first fortran's
test)

So i think the question is what do we want to do in the immediate term.

We can remove the test to get cleaner testresults on rv32.  I'm not a 
big fan of removing tests, but this test just doesn't make sense on rv32 
as-is.



We could leave things alone for now on the assumption the test will be 
rewritten to check for calls to the proper routines and possibly 
extended to include runtime verification.


I tend to lean towards the first.  That obviously wouldn't close the 
door on re-adding the test later with runtime verification and such.


Palmer, do you have a strong opinion either way?

jeff


Re: [RFC/RFA] [PATCH 02/12] Add built-ins and tests for bit-forward and bit-reversed CRCs

2024-05-31 Thread Jeff Law




On 5/28/24 12:44 AM, Richard Biener wrote:

On Mon, May 27, 2024 at 5:16 PM Jeff Law  wrote:




On 5/27/24 12:38 AM, Richard Biener wrote:

On Fri, May 24, 2024 at 10:44 AM Mariam Arutunian
 wrote:


This patch introduces new built-in functions to GCC for computing bit-forward 
and bit-reversed CRCs.
These builtins aim to provide efficient CRC calculation capabilities.
When the target architecture supports CRC operations (as indicated by the 
presence of a CRC optab),
the builtins will utilize the expander to generate CRC code.
In the absence of hardware support, the builtins default to generating code for 
a table-based CRC calculation.


I wonder whether for embedded target use we should arrange for the
table-based CRC calculation to be out-of-line and implemented in a
way so uses across TUs can be merged?  I guess a generic
implementation inside libgcc is difficult?

I think the difficulty is the table is dependent upon the polynomial.
So we'd have to arrange to generate, then pass in the table.

In theory we could have the linker fold away duplicate tables as those
should be in read only sections without relocations to internal members.
   So much like we do for constant pool entries.  Though this hasn't been
tested.

The CRC implementation itself could be subject to ICF if it's off in its
own function.  If it's inlined (and that's a real possibility), then
there's little hope of ICF helping on the codesize.


I was wondering of doing some "standard" mangling in the implementation
namespace and using comdat groups for both code and data?
But I'm not sure how that really solves anything given the dependencies 
on the polynomial.  ie, the contents of the table varies based on that 
polynomial and the polynomial can (and will) differ across CRC 
implementations.








Or we could just not do any of this for -Os/-Oz if the target doesn't
have a carryless multiply or crc with the appropriate polynomial.  Given
the CRC table is probably larger than all the code in a bitwise
impementation, disabling for -Os/-Oz seems like a very reasonable choice.


I was mainly thinking about the case where the user uses the new builtins,
but yes, when optimizing for size we can disable the recognition of open-coded
variants.

Turns out Mariam's patch already disables this for -Os.  :-)

For someone directly using the builtin, they're going to have to pass 
the polynomial as a constant to the builtin, with the possible exception 
of when the target has a crc instruction where the polynomial is defined 
by the hardware.


Jeff


Re: [PATCH 5/5][v3] RISC-V: Avoid inserting after a GIMPLE_COND with SLP and early break

2024-05-31 Thread Jeff Law




On 5/31/24 7:44 AM, Richard Biener wrote:

When vectorizing an early break loop with LENs (do we miss some
check here to disallow this?) we can end up deciding to insert
stmts after a GIMPLE_COND when doing SLP scheduling and trying
to be conservative with placing of stmts only dependent on
the implicit loop mask/len.  The following avoids this, I guess
it's not perfect but it does the job fixing some observed
RISC-V regression.

* tree-vect-slp.cc (vect_schedule_slp_node): For mask/len
loops make sure to not advance the insertion iterator
beyond a GIMPLE_COND.
Note this patch may depend on others in the series.  I don't think the 
pre-commit CI tester is particularly good at handling that, particularly 
if the other patches in the series don't have the tagging for the 
pre-commit CI.


What most likely happened is this patch and only this patch was applied 
against the baseline for testing.


There are (manual) ways to get things re-tested.  I'm hoping Patrick and 
Edwin automate that procedure relatively soon.  Until that happens you 
have to email patchworks...@rivosinc.com with a URL for the patch in 
patchwork that you want retested.




Jeff



Re: [PING] [PATCH] RISC-V: Add Zfbfmin extension

2024-05-31 Thread Jeff Law




On 5/30/24 5:38 AM, Xiao Zeng wrote:

1 In the previous patch, the libcall for BF16 was implemented:


2 Riscv provides Zfbfmin extension, which completes the "Scalar BF16 Converts":


3 Implemented replacing libcall with Zfbfmin extension instruction.

4 Reused previous testcases in:


gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_output_move): Handle BFmode move
for zfbfmin.
* config/riscv/riscv.md (truncsfbf2): New pattern for BFmode.
(trunchfbf2): Dotto.
(truncdfbf2): Dotto.
(trunctfbf2): Dotto.
(extendbfsf2): Dotto.
(*movhf_hardfloat): Add BFmode.
(*mov_hardfloat): Dotto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/zfbfmin-bf16_arithmetic.c: New test.
* gcc.target/riscv/zfbfmin-bf16_comparison.c: New test.
* gcc.target/riscv/zfbfmin-bf16_float_libcall_convert.c: New test.
* gcc.target/riscv/zfbfmin-bf16_integer_libcall_convert.c: New test.
---









+
+;; The conversion of HF/DF/TF to BF needs to be done with SF if there is a
+;; chance to generate at least one instruction, otherwise just using
+;; libfunc __trunc[h|d|t]fbf2.
+(define_expand "trunchfbf2"
+  [(set (match_operand:BF0 "register_operand" "=f")
+   (float_truncate:BF
+  (match_operand:HF 1 "register_operand" " f")))]
+  "TARGET_ZFBFMIN"
+  {
+convert_move (operands[0],
+ convert_modes (SFmode, HFmode, operands[1], 0), 0);
+DONE;
+  }
+  [(set_attr "type" "fcvt")
+   (set_attr "mode" "BF")])

I would suggest using a mode iterator to avoid explicit pattern duplication.

Essentially a mode iterator allows you to specify that the pattern 
should be repeated over a series of modes.


It looks like you've deine a conversion from HF, DF, TF.  So you define 
an iterator that includes just those modes.  You would use the mode 
iterator rather than BF, DF or TF in your pattern.


That just fixes the mode in the pattern.  You also need to have the name 
automagically adjust as well.  Use  in the name.  so the name 
would be somethig like truncbf2.


When you want to reference the mode in code you can do something like
E_mode

And that will map down to HFmode, BFmode, TFmode appropriately.

I suspect you can do something similar for the extension patterns.

In fact, it looks like you did this for the movehardfloat pattern.

Jeff


Re: [PATCH] ifcvt: Clarify if_info.original_cost.

2024-05-31 Thread Jeff Law




On 5/31/24 9:03 AM, Robin Dapp wrote:

Hi,

before noce_find_if_block processes a block it sets up an if_info
structure that holds the original costs.  At that point the costs of
the then/else blocks have not been added so we only care about the
"if" cost.

The code originally used BRANCH_COST for that but was then changed
to COST_N_INSNS (2) - a compare and a jump.
This patch computes the jump costs via
   insn_cost (if_info.jump, ...)
which is supposed to incorporate the branch costs and, in case of a CC
comparison,
   pattern_cost (if_info.cond, ...)
which is supposed to account for the CC creation.

For compare_and_jump patterns insn_cost should have already computed
the right cost.

Does this "split" make sense, generally?

Bootstrapped and regtested on x86, aarch64 and power10.  Regtested
on riscv.

Regards
  Robin

gcc/ChangeLog:

* ifcvt.cc (noce_process_if_block): Subtract condition pattern
cost if applicable.
(noce_find_if_block): Use insn_cost and pattern_cost for
original cost.
OK.  Obviously we'll need to be on the lookout for regressions.  My bet 
is on s390 since you already tested the x86, aarch64 & p10 targets :-)



jeff



Re: [PATCH] RISC-V: Add min/max patterns for ifcvt.

2024-05-31 Thread Jeff Law




On 5/31/24 9:07 AM, Robin Dapp wrote:

Hi,

ifcvt likes to emit

(set
   (if_then_else)
 (ge (reg 1) (reg2))
 (reg 1)
 (reg 2))

which can be recognized as min/max patterns in the backend.
This patch adds such patterns and the respective iterators as well as a
test.

This depends on the generic ifcvt change.
Regtested on rv64gcv_zvfh_zicond_zbb_zvbb.

Regards
  Robin

gcc/ChangeLog:

* config/riscv/bitmanip.md (*_cmp_3):
New min/max ifcvt pattern.
* config/riscv/iterators.md (minu): New iterator.
* config/riscv/riscv.cc (riscv_noce_conversion_profitable_p):
Remove riscv-specific adjustment.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/zbb-min-max-04.c: New test.

OK once prereqs are ACK'd.


Presumably this fixes that deepsjeng missed if-conversion we kicked 
around a couple months back?



jeff


Re: Reverted recent patches to resource.cc

2024-05-30 Thread Jeff Law




On 5/30/24 8:09 PM, Hans-Peter Nilsson wrote:

Date: Wed, 29 May 2024 21:23:58 -0600
Cc: gcc-patches@gcc.gnu.org



I don't bother with qemu.exp at all.  I've set up binfmt handlers so
that I can execute foreign binaries.

So given a root filesystem, I can chroot into it and do whatever I need.
   As far as dejagnu is concerned it looks like the native system.


Interesting.  In a Docker setup (or similar container)?
I've got docker containers for some of these.  m68k, alpha, s390 for 
example.   They're based off debian base images.  I don't have anything 
for sparc and I don't see a base image to build from.  If there was a 
base image, then a Dockerfile like this would get you started:



FROM sparc64/debian:latest

# Basic environment variables so we can apt-get without interactive prompts
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive
ENV TZ=Etc/UTC

RUN apt-get update && apt-get -y install gcc dejagnu binutils 
default-jre-headless git build-essential autoconf bison flex gawk make 
texinfo help2man libncurses5-dev python3-dev python3-distutils libtool
libtool-bin unzip wget curl rsync texinfo g++ libmpc-dev libgmp-dev 
libmpfr-dev libgmp-dev python3 libisl-dev rsync vim automake autoconf 
autotools-dev unzip help2man libtool libtool-bin sudo curl wget pyt

hon3-dev bzip2 xz-utils gdb bc libssl-dev libelf-dev


With the lack of an existing docker image, you can probably use 
debootstrap to construct an initial chroot, the import that into docker, 
adding whatever bits you need to do the build (ie, compilers, dejagnu, 
make, texinfo, blah blah blah) via apt until it works.  It's been a 
while since I've done it, but I'm pretty sure that's how I got things 
going on the sh4 and sh4eb platforms.



The JRE bits are only needed because these get launched as a docker 
swarm service and thus need to connect to the Jenkins server using JNLP. 
 Some of the packages are only needed for kernel builds or glibc builds.



Jeff




[to-be-committed] [RISC-V] Use Zbkb for general 64 bit constants when profitable

2024-05-30 Thread Jeff Law
Basically this adds the ability to generate two independent constants 
during  synthesis, then bring them together with a pack instruction. 
Thus we never need to go out to the constant pool when zbkb is enabled. 
The worst sequence we ever generate is


lui+addi+lui+addi+pack

Obviously if either half can be synthesized with just a lui or just an 
addi, then we'll DTRT automagically.   So for example:


unsigned long foo_0xf857f2def857f2de(void) {
return 0x14252800;
}

The high and low halves are just a lui.  So the final synthesis is:



li  a5,671088640# 15[c=4 l=4]  *movdi_64bit/1
li  a0,337969152# 16[c=4 l=4]  *movdi_64bit/1
packa0,a5,a0# 17[c=12 l=4]  riscv_xpack_di_si_2


On the implementation side, I think the bits I've put in here likely can 
be used to handle the repeating constant case for !zbkb.  I think it 
likely could be used to help capture cases where the upper half can be 
derived from the lower half (say by turning a bit on or off, shifting or 
something similar).  The key in both of these cases is we need a 
temporary register holding an intermediate value.


Ventana's internal tester enables zbkb, but I don't think any of the 
other testers currently exercise zbkb.  We'll probably want to change 
that at some point, but I don't think it's super-critical yet.


While I can envision a few more cases where we could improve constant 
synthesis,   No immediate plans to work in this space, but if someone is 
interested, some thoughts are recorded here:




https://wiki.riseproject.dev/display/HOME/CT_00_031+--+Additional+Constant+Synthesis+Improvements




Jeffgcc/
* config/riscv/riscv.cc (riscv_integer_op): Add new field.
(riscv_build_integer_1): Initialize the new field.
(riscv_built_integer): Recognize more cases where Zbkb's
pack instruction is profitable.
(riscv_move_integer): Loop over all the codes.  If requested,
save the current constant into a temporary.  Generate pack
for more cases using the saved constant.


diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 91fefacee80..10af38a5a81 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -250,6 +250,7 @@ struct riscv_arg_info {
and each VALUE[i] is a constant integer.  CODE[0] is undefined.  */
 struct riscv_integer_op {
   bool use_uw;
+  bool save_temporary;
   enum rtx_code code;
   unsigned HOST_WIDE_INT value;
 };
@@ -759,6 +760,7 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
   codes[0].code = UNKNOWN;
   codes[0].value = value;
   codes[0].use_uw = false;
+  codes[0].save_temporary = false;
   return 1;
 }
   if (TARGET_ZBS && SINGLE_BIT_MASK_OPERAND (value))
@@ -767,6 +769,7 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
   codes[0].code = UNKNOWN;
   codes[0].value = value;
   codes[0].use_uw = false;
+  codes[0].save_temporary = false;
 
   /* RISC-V sign-extends all 32bit values that live in a 32bit
 register.  To avoid paradoxes, we thus need to use the
@@ -796,6 +799,7 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
  alt_codes[alt_cost-1].code = PLUS;
  alt_codes[alt_cost-1].value = low_part;
  alt_codes[alt_cost-1].use_uw = false;
+ alt_codes[alt_cost-1].save_temporary = false;
  memcpy (codes, alt_codes, sizeof (alt_codes));
  cost = alt_cost;
}
@@ -810,6 +814,7 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
  alt_codes[alt_cost-1].code = XOR;
  alt_codes[alt_cost-1].value = low_part;
  alt_codes[alt_cost-1].use_uw = false;
+ alt_codes[alt_cost-1].save_temporary = false;
  memcpy (codes, alt_codes, sizeof (alt_codes));
  cost = alt_cost;
}
@@ -852,6 +857,7 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
  alt_codes[alt_cost-1].code = ASHIFT;
  alt_codes[alt_cost-1].value = shift;
  alt_codes[alt_cost-1].use_uw = use_uw;
+ alt_codes[alt_cost-1].save_temporary = false;
  memcpy (codes, alt_codes, sizeof (alt_codes));
  cost = alt_cost;
}
@@ -873,9 +879,11 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
  codes[0].value = (((unsigned HOST_WIDE_INT) value >> trailing_ones)
| (value << (64 - trailing_ones)));
  codes[0].use_uw = false;
+ codes[0].save_temporary = false;
  codes[1].code = ROTATERT;
  codes[1].value = 64 - trailing_ones;
  codes[1].use_uw = false;
+ codes[1].save_temporary = false;
  cost = 2;
}
   /* Handle the case where the 11 bit range of zero bits wraps around.  */
@@ -888,9 +896,11 @@ 

Re: Reverted recent patches to resource.cc

2024-05-29 Thread Jeff Law




On 5/29/24 8:41 PM, Hans-Peter Nilsson wrote:




I do bootstraps and regression testsuite runs on a variety of systems
via qemu (alpha, m68k, aarch64, s390, ppc64, etc).  It ain't fast, but
it does work if QEMU is in pretty good shape and you can find a root
filesystem to use.


That might certainly fit the bill.  I guess you mean with a
filesystem image for e.g. sparc-linux?

I keep postponing looking into getting a working setup
(mostly the baseboard file) for qemu-anything + newlib.
Last I looked, qemu.exp had a serious typo...but I see that
was just for arm-eabi and arm-pi4, so yes, that might be a
viable path, thanks for the reminder.
I don't bother with qemu.exp at all.  I've set up binfmt handlers so 
that I can execute foreign binaries.


So given a root filesystem, I can chroot into it and do whatever I need. 
 As far as dejagnu is concerned it looks like the native system.



Jeff



Re: Reverted recent patches to resource.cc

2024-05-29 Thread Jeff Law




On 5/29/24 7:28 PM, Hans-Peter Nilsson wrote:

From: Hans-Peter Nilsson 
Date: Mon, 27 May 2024 19:51:47 +0200



2: Does not depend on 1, but corrects an incidentally found wart:
find_basic_block calls fails too often.  Replace it with "modern"
insn-to-basic-block cross-referencing.

3: Just an addendum to 2: removes an "if", where the condition is now
always-true, dominated by a gcc_assert, and where the change in
indentation was too ugly.

4: Corrects another incidentally found wart: for the last 15 years the
code in resource.cc has only been called from within reorg.cc (and
reorg.c), specifically not possibly before calling init_resource_info
or after free_resource_info, so we can discard the code that tests
certain allocated arrays for NULL.  I didn't even bother with a
gcc_assert; besides some gen*-generated files, only reorg.cc includes
resource.h (not to be confused with the system sys/resource.h).
A grep says the #include resource.h can be removed from those gen*
files and presumably from RESOURCE_H(!) as well.  Some Other Time.
Also, removed a redundant "if (tinfo != NULL)" and moved the then-code
into the previous then-clause.

   resource.cc: Replace calls to find_basic_block with cfgrtl
 BLOCK_FOR_INSN
   resource.cc (mark_target_live_regs): Remove check for bb not found
   resource.cc: Remove redundant conditionals


I had to revert those last three patches due to PR
bootstrap/115284.  I hope to revisit once I have a means to
reproduce (and fix) the underlying bug.  It doesn't have to
be a bug with those changes per-se: IMHO the "improved"
lifetimes could just as well have uncovered a bug elsewhere
in reorg.  It's still on me to resolve that situation; done.
I'm just glad the cause was the incidental improvements and
not the original bug I wanted to fix.

There appears to be only a single supported SPARC machine in
cfarm: cfarm216, and I currently can't reach it due to what
appears to be issues at my end.  I guess I'll either fix
that or breathe life into sparc-elf+sim.

Or if you've got a reasonable server to use, QEMU might save you :-)

I do bootstraps and regression testsuite runs on a variety of systems 
via qemu (alpha, m68k, aarch64, s390, ppc64, etc).  It ain't fast, but 
it does work if QEMU is in pretty good shape and you can find a root 
filesystem to use.


jeff


Re: [RFC/RFA] [PATCH 08/12] Add a new pass for naive CRC loops detection

2024-05-29 Thread Jeff Law




On 5/28/24 1:01 AM, Richard Biener wrote:

On Fri, May 24, 2024 at 10:46 AM Mariam Arutunian
 wrote:


This patch adds a new compiler pass aimed at identifying naive CRC 
implementations,
characterized by the presence of a loop calculating a CRC (polynomial long 
division).
Upon detection of a potential CRC, the pass prints an informational message.

Performs CRC optimization if optimization level is >= 2,
besides optimizations for size and if fno_gimple_crc_optimization given.

This pass is added for the detection and optimization of naive CRC 
implementations,
improving the efficiency of CRC-related computations.

This patch includes only initial fast checks for filtering out non-CRCs,
detected possible CRCs verification and optimization parts will be provided in 
subsequent patches.


Just a few quick questions - I'm waiting for a revision with Jeffs 
comments cleared before having a closer look.  The patch does

nothing but analyze right now, correct?  I assume a later patch will
fill in stuff in ::execute and use the return value of
loop_may_calculate_crc (it's a bit odd to review such a "split"
thing).
We split it up on functional chunks.  I think if it gets approved it 
probably should go in atomically since it makes no sense to commit the 
first pass recognition filter without the validation step or the 
validation step without the codegen step.


So consider the break down strictly for review convenience.




I think what this does fits final value replacement which lives in 
tree-scalar-evolution.cc and works from the loop-closed PHIs, trying

to replace those.  I'm not sure we want to have a separate pass for
this.  Consider a loop calculating two or four CRCs in parallel, 
replacing LC PHIs one-by-one should be able to handle this.
I suspect that'll be quite hard for both the "does this generally look 
like a CRC loop" code as well as the "validate this is a CRC loop" code.


Mariam, your thoughts on whether or not those two phases could handle a 
loop with two CRC calculations inside, essentially creating two calls to 
our new builtins?


Jeff




[to-be-committed] [RISC-V] Use pack to handle repeating constants

2024-05-28 Thread Jeff Law
This patch utilizes zbkb to improve the code we generate for 64bit 
constants when the high half is a duplicate of the low half.


Basically we generate the low half and use a pack instruction with that 
same register repeated.  ie


pack dest,src,src

That gives us a maximum sequence of 3 instructions and sometimes it will 
be just 2 instructions (say if the low 32bits can be constructed with a 
single addi or lui).


As with shadd, I'm abusing an RTL opcode.  This time it's CONCAT.  It's 
reasonably close to what we're doing.  Obviously it's just how we 
identify the desire to generate a pack in the array of opcodes.  We 
don't actually emit a CONCAT.


Note that we don't care about the potential sign extension from bit 31. 
pack will only look at bits 0..31 of each input (for rv64).  So we go 
ahead and sign extend before synthesizing the low part as that allows us 
to handle more cases trivially.


I had my testsuite generator chew on random cases of a repeating 
constant without any surprises.  I don't see much point in including all 
those in the testcase (after all there's 2**32 of them).  I've got a set 
of 10 I'm including.  Nothing particularly interesting in them.


An enterprising developer that needs this improved without zbkb could 
probably do so with a bit of work.  First increase the cost by 1 unit. 
Second avoid cases where bit 31 is set and restrict it to cases when we 
can still create pseudos.   On the codegen side, when encountering the 
CONCAT, generate the appropriate shift of "X" into a temporary register, 
then IOR the temporary with "X" into the new destination.


Anyway, I've tested this in my tester (though it doesn't turn on zbkb, 
yet).  I'll let the CI system chew on it overnight, but like mine, I 
don't think it lights up zbkb.  So it's unlikely to spit out anything 
interesting.


Jeffdiff --git a/gcc/config/riscv/crypto.md b/gcc/config/riscv/crypto.md
index b632312ade2..b9cac78fce1 100644
--- a/gcc/config/riscv/crypto.md
+++ b/gcc/config/riscv/crypto.md
@@ -107,7 +107,7 @@ (define_insn "riscv_pack_"
 ;; This is slightly more complex than the other pack patterns
 ;; that fully expose the RTL as it needs to self-adjust to
 ;; rv32 and rv64.  But it's not that hard.
-(define_insn "*riscv_xpack__2"
+(define_insn "riscv_xpack___2"
   [(set (match_operand:X 0 "register_operand" "=r")
(ior:X (ashift:X (match_operand:X 1 "register_operand" "r")
 (match_operand 2 "immediate_operand" "n"))
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index a99211d56b1..91fefacee80 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -1123,6 +1123,22 @@ riscv_build_integer (struct riscv_integer_op *codes, 
HOST_WIDE_INT value,
}
 }
 
+  /* With pack we can generate a 64 bit constant with the same high
+ and low 32 bits triviall.  */
+  if (cost > 3 && TARGET_64BIT && TARGET_ZBKB)
+{
+  unsigned HOST_WIDE_INT loval = value & 0x;
+  unsigned HOST_WIDE_INT hival = value & ~loval;
+  if (hival >> 32 == loval)
+   {
+ cost = 1 + riscv_build_integer_1 (codes, sext_hwi (loval, 32), mode);
+ codes[cost - 1].code = CONCAT;
+ codes[cost - 1].value = 0;
+ codes[cost - 1].use_uw = false;
+   }
+
+}
+
   return cost;
 }
 
@@ -2679,6 +2695,13 @@ riscv_move_integer (rtx temp, rtx dest, HOST_WIDE_INT 
value,
  rtx t = can_create_pseudo_p () ? gen_reg_rtx (mode) : temp;
  x = riscv_emit_set (t, x);
}
+ else if (codes[i].code == CONCAT)
+   {
+ rtx t = can_create_pseudo_p () ? gen_reg_rtx (mode) : temp;
+ rtx t2 = gen_lowpart (SImode, x);
+ emit_insn (gen_riscv_xpack_di_si_2 (t, x, GEN_INT (32), t2));
+ x = t;
+   }
  else
x = gen_rtx_fmt_ee (codes[i].code, mode,
x, GEN_INT (codes[i].value));
diff --git a/gcc/testsuite/gcc.target/riscv/synthesis-9.c 
b/gcc/testsuite/gcc.target/riscv/synthesis-9.c
new file mode 100644
index 000..cc622188abc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/synthesis-9.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target rv64 } */
+/* We aggressively skip as we really just need to test the basic synthesis
+   which shouldn't vary based on the optimization level.  -O1 seems to work
+   and eliminates the usual sources of extraneous dead code that would throw
+   off the counts.  */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O2" "-O3" "-Os" "-Oz" "-flto" } } 
*/
+/* { dg-options "-march=rv64gc_zba_zbb_zbkb_zbs" } */
+
+/* Rather than test for a specific synthesis of all these constants or
+   having thousands of tests each testing one variant, we just test the
+   total number of instructions.
+
+   This isn't expected to change much and any change is worthy of a look.  */
+/* { dg-final { scan-assembler-times 

Re: [RFC/RFA] [PATCH 08/12] Add a new pass for naive CRC loops detection

2024-05-27 Thread Jeff Law




On 5/24/24 2:42 AM, Mariam Arutunian wrote:
This patch adds a new compiler pass aimed at identifying naive CRC 
implementations,
characterized by the presence of a loop calculating a CRC (polynomial 
long division).

Upon detection of a potential CRC, the pass prints an informational message.

Performs CRC optimization if optimization level is >= 2,
besides optimizations for size and if fno_gimple_crc_optimization given.

This pass is added for the detection and optimization of naive CRC 
implementations,

improving the efficiency of CRC-related computations.

This patch includes only initial fast checks for filtering out non-CRCs,
detected possible CRCs verification and optimization parts will be 
provided in subsequent patches.


   gcc/

     * Makefile.in (OBJS): Add gimple-crc-optimization.o.
     * common.opt (fgimple-crc-optimization): New option.
     * doc/invoke.texi (-fgimple-crc-optimization): Add documentation.
     * gimple-crc-optimization.cc: New file.
     * gimple.cc (set_phi_stmts_not_visited): New function.
     (set_gimple_stmts_not_visited): Likewise.
     (set_bbs_stmts_not_visited): Likewise.
     * gimple.h (set_gimple_stmts_not_visited): New extern function 
declaration.

     (set_phi_stmts_not_visited): New extern function declaration.
     (set_bbs_stmts_not_visited): New extern function declaration.
     * opts.cc (default_options_table): Add OPT_fgimple_crc_optimization.
     (enable_fdo_optimizations): Enable gimple-crc-optimization.
     * passes.def (pass_crc_optimization): Add new pass.
     * timevar.def (TV_GIMPLE_CRC_OPTIMIZATION): New timevar.
     * tree-pass.h (make_pass_crc_optimization): New extern function 
declaration.


Signed-off-by: Mariam Arutunian >



diff --git a/gcc/common.opt b/gcc/common.opt
index 2c078fdd1f8..53f7ab255dd 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1757,6 +1757,12 @@ Common Var(flag_gcse_after_reload) Optimization
 Perform global common subexpression elimination after register allocation has
 finished.
 
+fgimple-crc-optimization

+Common Var(flag_gimple_crc_optimization) Optimization
+Detect loops calculating CRC and replace with faster implementation.
+If the target supports carry-less-multiplication instruction, generate CRC 
using
+it; otherwise generate table-based CRC.
This probably needs a minor update since your code can also generate a 
CRC instruction on x86 when it detects a CRC loop with the right polynomial.







+
+  /* Returns true if there is only two conditional blocks in the loop
+ (one may be for the CRC bit check and the other for the loop counter).
+ This may filter out some real CRCs, where more than one condition
+ is checked for the CRC calculation.  */
+  static bool loop_contains_two_conditional_bb (basic_block *loop_bbs,
+   unsigned loop_num_nodes);
It's been a while, so if we're rehashing something we already worked 
through, I apologize.


IIRC we looked at the problem of canonicalizing the loop into a form 
where we didn't necessarily have conditional blocks, instead we had 
branchless sequences for the conditional xor and dealing with the high 
bit in the crc.  My recollection was that the coremark CRC loop would 
always canonicalize, but that in general we still saw multiple CRC 
implementations that did not canonicalize and thus we still needed the 
more complex matching.  Correct?




+
+  /* Checks whether found XOR_STMT is for calculating CRC.
+ The function CRC_FUN calculates CRC only if there is a shift operation
+ in the crc loop.  */
+  bool xor_calculates_crc (function *crc_fun, basic_block *loop_bbs,
+  const gimple *xor_stmt);
So the second sentence in the comment doesn't really seem to relate to 
this function.  It also seems like we potentially have two xors in a CRC 
loop.  One which xors one bit of the data with one bit of the crc.  The 
other is the conditional xor.  Which does this refer to?  I'm guesing 
the latter since the former can likely hoist out of the loop given a 
sufficiently smart loop optimizer.





+
+  /* Checks that the variable used in the condition COND is the assumed CRC
+ (or depends on the assumed CRC).
+ Also sets data member m_phi_for_data if it isn't set and exists.  */
+  bool is_crc_checked (gcond *cond, basic_block *loop_bbs);
+
+  /* Returns true if condition COND checks MSB/LSB bit is 1.
+ Otherwise, return false.  */
+  static bool cond_true_is_checked_for_bit_one (const gcond *cond);
+
+  /* Returns opposite block of the XOR_BB from PRED_BB's dest blocks.  */
+  static basic_block get_xor_bb_opposite (basic_block pred_bb,
+ basic_block xor_bb);
+
+  /* Checks whether the pair of xor's shift exists in the opposite
+ basic block (OPPOSITE_BB).
+ If there is a shift and xor in the same block,
+ then in the opposite block must be another shift.  */
+  bool 

Re: [PATCH 4/4] resource.cc: Remove redundant conditionals

2024-05-27 Thread Jeff Law




On 5/27/24 11:54 AM, Hans-Peter Nilsson wrote:

Regtested cris-elf.  Ok to commit?

-- >8 --
No functional change.

- We always have a target_hash_table and bb_ticks because
init_resource_info is always called.  These conditionals are
an ancient artifact: it's been quite a while since
resource.cc was used elsewhere than exclusively from reorg.cc

- In mark_target_live_regs, get rid of a now-redundant "if
(tinfo != NULL)" conditional and replace an "if (bb)" with a
gcc_assert.

A "git diff -wb" (ignore whitespace diff) is better at
showing the actual changes.

* resource.cc (free_resource_info, clear_hashed_info_for_insn): Don't
check for non-null target_hash_table and bb_ticks.
(mark_target_live_regs): Ditto.  Replace check for non-NULL result from
BLOCK_FOR_INSN with a call to gcc_assert.  Fold code conditioned on
tinfo != NULL.

OK once prereqs are fully ack'd.

jeff



Re: [PATCH 3/4] resource.cc (mark_target_live_regs): Remove check for bb not found

2024-05-27 Thread Jeff Law




On 5/27/24 11:53 AM, Hans-Peter Nilsson wrote:

Regtested cris-elf.  Ok to commit?

-- >8 --
No functional change.

A "git diff -wb" (ignore whitespace diff) shows that this
commit just removes a "if (b != -1)" after a "gcc_assert (b
!= -1)" and also removes the subsequent "else" clause.

* resource.cc (mark_target_live_regs): Remove redundant check for b
being -1, after gcc_assert.

OK once prereqs are approved.

jeff



Re: [PATCH 2/4] resource.cc: Replace calls to find_basic_block with cfgrtl BLOCK_FOR_INSN

2024-05-27 Thread Jeff Law




On 5/27/24 11:52 AM, Hans-Peter Nilsson wrote:

Regtested cris-elf.  Ok to commit?

-- >8 --
...and call compute_bb_for_insn in init_resource_info and
free_bb_for_insn in free_resource_info.

I put a gcc_unreachable in that else-clause for a failing
find_basic_block in mark_target_live_regs after the comment that says:

 /* We didn't find the start of a basic block.  Assume everything
in use.  This should happen only extremely rarely.  */
 SET_HARD_REG_SET (res->regs);

and found that it fails not extremely rarely but extremely early in
the build (compiling libgcc).

That kind of pessimization leads to suboptimal delay-slot-filling.
Instead, do like many machine_dependent_reorg passes and call
compute_bb_for_insn as part of resource.cc initialization.

After this patch, there's a whole "if (b != -1)" conditional that's
dominated by a gcc_assert (b != -1).  I separated that, as it's a NFC
whitespace patch that hampers patch readability.

Altogether this improved coremark performance for CRIS at -O2
-march=v10 by 0.36%.

* resource.cc: Include cfgrtl.h.  Use BLOCK_FOR_INSN (insn)->index
instead of calling find_basic_block (insn).  Assert for not -1.
(find_basic_block): Remove function.
(init_resource_info): Call compute_bb_for_insn.
(free_resource_info): Call free_bb_for_insn.
I'm pretty sure that code as part of the overall problem -- namely that 
we didn't have good basic block info so we resorted to insn scanning.


Presumably we set BLOCK_FOR_INSN when we generate a wrapper SEQUENCE 
insns for a filled delay slot?  Assuming we do create the right mapping 
for those new insns, then this is OK.


jeff





Re: [PATCH 1/4] resource.cc (mark_target_live_regs): Don't look past target insn, PR115182

2024-05-27 Thread Jeff Law




On 5/27/24 11:52 AM, Hans-Peter Nilsson wrote:



The problem is in mark_target_live_regs: it consults a hash-table
indexed by insn uid, where it tracks the currently live registers with
a "generation" count to handle when it moves around insn, filling
delay-slots.  As a fall-back, it starts with registers live at the
start of each basic block, calculated by the comparatively modern df
machinery (except that it can fail finding out which basic block an
insn belongs to, at which times it includes all registers film at 11),
and tracks the semantics of insns up to each insn.
My recollection of the resource tracking code isn't great; much of the 
problem with this code is it pre-dates the existence of a CFG and was 
relying on life information built by flow.c for use by combine.c.  Note 
there are numerous passes between where the old flow.c sat and reorg.c 
sat.   So we had to cope with various transformations that happened 
between those two points.




You'd think that's all that should be done, but then for some reason
it *also* looks at insns *after the target insn* up to a few branches,
and includes that in the set of live registers!  This is the code in
mark_target_live_regs that starts with the call to
find_dead_or_set_registers.  I couldn't make sense of it, so I looked
at its history, and I think I found the cause; it's a thinko or
possibly two thinkos.  The original implementation, gcc-git-described
as r0-97-g9c7e297806a27f, later moved from reorg.c to resource.c in
r0-20470-gca545bb569b756.

I have no recollection of doing that review for Diego...

Conceptually we should be able to look at the livein set of the target 
block and be done.  Anything used before it was set at the target must 
be in the livein set.  Something set at the target before being used 
shouldn't be in the livein set.





I believe the "extra" lookup was intended to counter flaws in the
reorg.c/resource.c register liveness analysis; to inspect insns along
the execution paths to exclude registers that, when looking at
subsequent insns, weren't live.  That guess is backed by a sentence in
the updated (i.e. deleted) part of the function head comment for
mark_target_live_regs: "Next, scan forward from TARGET looking for
things set or clobbered before they are used.  These are not live."
To me that sounds like flawed register-liveness data.

More likely than not you're right.




I think, with a reliable "DF_LR_IN", the whole thing *after* tracking
from start-of-bb up to the target insn should be removed; thus.

This patch also removes the now-unused find_dead_or_set_registers
function.

At r15-518, it fixes the issue for CRIS and improves coremark scores
at -O2 -march=v10 a tiny bit (about 0.05%).

PR rtl-optimization/115182
* resource.cc (mark_target_live_regs): Don't look for
unconditional branches after the target to improve on the
register liveness.
(find_dead_or_set_registers): Remove unused function.

OK.

Jeff



Re: [PATCH 0/4] Some improvements to resource.cc, including fixing PR115182

2024-05-27 Thread Jeff Law




On 5/27/24 11:51 AM, Hans-Peter Nilsson wrote:

The code in resource.cc is exclusively used by the delay-slot-filling
machinery: the "dbr" pass in reorg.cc, sometimes referred to just as
"reorg".  Its implementation is quite arcane, scanning RTL, with only
a little dash of cfgrtl.  I'm sure some think that rather than fixing
this machinery, it better die with architectures having delay-slots.
Those people will hopefully still enjoy these patches, as the bulk is
code removal.
Well, I always wanted it changed to use the scheduler's dependency 
analysis code to give us the candidate set of insns for delay slot 
filling rather than doing a bespoke life analysis.  But we never got 
that implemented and delay slot architectures have diminished in their 
importance over the last 25 years :-)


Jeff


[to-be-committed] [RISC-V] Some basic patterns for zbkb code generation

2024-05-27 Thread Jeff Law
And here's Lyut's basic Zbkb support.  Essentially it's four new 
patterns for packh, packw, pack plus a bridge pattern needed for packh.


packw is a bit ugly as we need to match a sign extension in an 
inconvenient location.  We pull it out so that the extension is exposed 
in a convenient place for subsequent sign extension elimination.


We need a bridge pattern to get packh.  Thankfully the bridge pattern is 
a degenerate packh where one operand is x0, so it works as-is without 
splitting and provides the bridge to the more general form of packh.


This patch also refines the condition for the constant reassociation 
patch to avoid a few more cases than can be handled efficiently with 
other preexisting patterns and one bugfix to avoid losing bits, 
particularly in the xor/ior case.


Lyut did the core work here.  I think I did some minor cleanups and the 
bridge pattern to work with gcc-15 and beyond.


This is a prerequisite for using zbkb in constant synthesis.  It also 
stands on its own.  I know we've seen it trigger in spec without the 
constant synthesis bits.


It's been through our internal CI and my tester.  I'll obviously wait 
for the upstream CI to finish before taking further action.


Jeffdiff --git a/gcc/config/riscv/crypto.md b/gcc/config/riscv/crypto.md
index dd2bc94ee88..b632312ade2 100644
--- a/gcc/config/riscv/crypto.md
+++ b/gcc/config/riscv/crypto.md
@@ -104,6 +104,19 @@ (define_insn "riscv_pack_"
   "pack\t%0,%1,%2"
   [(set_attr "type" "crypto")])
 
+;; This is slightly more complex than the other pack patterns
+;; that fully expose the RTL as it needs to self-adjust to
+;; rv32 and rv64.  But it's not that hard.
+(define_insn "*riscv_xpack__2"
+  [(set (match_operand:X 0 "register_operand" "=r")
+   (ior:X (ashift:X (match_operand:X 1 "register_operand" "r")
+(match_operand 2 "immediate_operand" "n"))
+  (zero_extend:X
+(match_operand:HX 3 "register_operand" "r"]
+  "TARGET_ZBKB && INTVAL (operands[2]) == BITS_PER_WORD / 2"
+  "pack\t%0,%3,%1"
+  [(set_attr "type" "crypto")])
+
 (define_insn "riscv_packh_"
   [(set (match_operand:X 0 "register_operand" "=r")
 (unspec:X [(match_operand:QI 1 "register_operand" "r")
@@ -113,6 +126,29 @@ (define_insn "riscv_packh_"
   "packh\t%0,%1,%2"
   [(set_attr "type" "crypto")])
 
+;; So this is both a useful pattern unto itself and a bridge to the
+;; general packh pattern below.
+(define_insn "*riscv_packh__2"
+  [(set (match_operand:X 0 "register_operand" "=r")
+   (and:X (ashift:X (match_operand:X 1 "register_operand" "r")
+(const_int 8))
+  (const_int 65280)))]
+ "TARGET_ZBKB"
+ "packh\t%0,x0,%1"
+ [(set_attr "type" "crypto")])
+
+;; While the two operands of the IOR could be swapped, this appears
+;; to be the canonical form.  The other form doesn't seem to trigger.
+(define_insn "*riscv_packh__3"
+  [(set (match_operand:X 0 "register_operand" "=r")
+   (ior:X (and:X (ashift:X (match_operand:X 1 "register_operand" "r")
+   (const_int 8))
+ (const_int 65280))
+  (zero_extend:X (match_operand:QI 2 "register_operand" "r"]
+ "TARGET_ZBKB"
+ "packh\t%0,%2,%1"
+ [(set_attr "type" "crypto")])
+
 (define_insn "riscv_packw"
   [(set (match_operand:DI 0 "register_operand" "=r")
 (unspec:DI [(match_operand:HI 1 "register_operand" "r")
@@ -122,6 +158,33 @@ (define_insn "riscv_packw"
   "packw\t%0,%1,%2"
   [(set_attr "type" "crypto")])
 
+;; Implemented as a splitter for initial recognition.  It generates
+;; new RTL with the extension moved to the outer position.  This
+;; allows later code to eliminate subsequent explicit sign extensions.
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+   (ior:DI (ashift:DI
+ (sign_extend:DI (match_operand:HI 1 "register_operand"))
+ (const_int 16))
+   (zero_extend:DI (match_operand:HI 2 "register_operand"]
+  "TARGET_ZBKB && TARGET_64BIT"
+  [(set (match_dup 0)
+   (sign_extend:DI (ior:SI (ashift:SI (match_dup 1) (const_int 16))
+   (zero_extend:SI (match_dup 2)]
+  "operands[1] = gen_lowpart (SImode, operands[1]);")
+
+;; And this patches the result of the splitter above.
+(define_insn "*riscv_packw_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (sign_extend:DI
+ (ior:SI
+   (ashift:SI (match_operand:SI 1 "register_operand" "r")
+  (const_int 16))
+   (zero_extend:SI (match_operand:HI 2 "register_operand" "r")]
+  "TARGET_ZBKB && TARGET_64BIT"
+  "packw\t%0,%2,%1"
+  [(set_attr "type" "crypto")])
+
 ;; ZBKX extension
 
 (define_insn "riscv_xperm4_"
diff --git a/gcc/config/riscv/iterators.md b/gcc/config/riscv/iterators.md
index 8a9d1986b4a..3c139bc2e30 100644
--- a/gcc/config/riscv/iterators.md
+++ b/gcc/config/riscv/iterators.md
@@ -37,6 +37,9 @@ 

Re: [RFC/RFA] [PATCH 12/12] Add tests for CRC detection and generation.

2024-05-27 Thread Jeff Law




On 5/27/24 12:39 AM, Richard Biener wrote:

On Sat, May 25, 2024 at 8:34 PM Jeff Law  wrote:




On 5/24/24 2:42 AM, Mariam Arutunian wrote:

gcc/testsuite/gcc.c-torture/compile/

  * crc-11.c: New test.
  * crc-15.c: Likewise.
  * crc-16.c: Likewise.
  * crc-19.c: Likewise.
  * crc-2.c: Likewise.
  * crc-20.c: Likewise.
  * crc-24.c: Likewise.
  * crc-29.c: Likewise.
  * crc-27.c: Likewise.
  * crc-3.c: Likewise.
  * crc-crc32-data24.c: Likewise.
  * crc-from-fedora-packages (1-24).c: Likewise.
  * crc-linux-(1-5).c: Likewise.
  * crc-not-crc-(1-26).c: Likewise.
  * crc-side-instr-(1-17).c: Likewise.

gcc/testsuite/gcc.c-torture/execute/

  * crc-(1, 4-10, 12-14, 17-18, 21-28).c: New tests.
  * crc-CCIT-data16-xorOutside_InsideFor.c: Likewise.
  * crc-CCIT-data16.c: Likewise.
  * crc-CCIT-data8.c: Likewise.
  * crc-coremark16-data16.c: Likewise.
  * crc-coremark16-data16.c: Likewise.
  * crc-coremark32-data32.c: Likewise.
  * crc-coremark32-data8.c: Likewise.
  * crc-coremark64-data64.c: Likewise.
  * crc-coremark8-data8.c: Likewise.
  * crc-crc32-data16.c: Likewise.
  * crc-crc32-data8.c: Likewise.
  * crc-crc32.c: Likewise.
  * crc-crc64-data32.c: Likewise.
  * crc-crc64-data64.c: Likewise.
  * crc-crc8-data8-loop-xorInFor.c: Likewise.
  * crc-crc8-data8-loop-xorOutsideFor.c: Likewise.
  * crc-crc8-data8-xorOustideFor.c: Likewise.
  * crc-crc8.c: Likewise.

Signed-off-by: Mariam Arutunian mailto:mariamarutun...@gmail.com>>

OK once all prerequisites are approved.


At some point gcc.c-torture/ was considered "legacy", is that no longer
true?  I'd prefer if you add these tests to gcc.dg/torture/ instead.
I don't recall that, but sure, it should be easy to move them over to 
gcc.dg/torture.  Noted for the future as well.


Jeff



Re: [RFC/RFA] [PATCH 02/12] Add built-ins and tests for bit-forward and bit-reversed CRCs

2024-05-27 Thread Jeff Law




On 5/27/24 12:38 AM, Richard Biener wrote:

On Fri, May 24, 2024 at 10:44 AM Mariam Arutunian
 wrote:


This patch introduces new built-in functions to GCC for computing bit-forward 
and bit-reversed CRCs.
These builtins aim to provide efficient CRC calculation capabilities.
When the target architecture supports CRC operations (as indicated by the 
presence of a CRC optab),
the builtins will utilize the expander to generate CRC code.
In the absence of hardware support, the builtins default to generating code for 
a table-based CRC calculation.


I wonder whether for embedded target use we should arrange for the 
table-based CRC calculation to be out-of-line and implemented in a

way so uses across TUs can be merged?  I guess a generic
implementation inside libgcc is difficult?
I think the difficulty is the table is dependent upon the polynomial. 
So we'd have to arrange to generate, then pass in the table.


In theory we could have the linker fold away duplicate tables as those 
should be in read only sections without relocations to internal members. 
 So much like we do for constant pool entries.  Though this hasn't been 
tested.


The CRC implementation itself could be subject to ICF if it's off in its 
own function.  If it's inlined (and that's a real possibility), then 
there's little hope of ICF helping on the codesize.


Or we could just not do any of this for -Os/-Oz if the target doesn't 
have a carryless multiply or crc with the appropriate polynomial.  Given 
the CRC table is probably larger than all the code in a bitwise 
impementation, disabling for -Os/-Oz seems like a very reasonable choice.



jeff


[to-be-committed][RISC-V] Reassociate constants in logical ops

2024-05-26 Thread Jeff Law
This patch from Lyut will reassociate operands when we have shifted 
logical operations.  This can simplify a constant that may not be fit in 
a simm12 into a form that does fit into a simm12.


The basic work was done by Lyut.  I generalized it to handle XOR/OR.

It stands on its own, but also helps the upcoming Zbkb work from Lyut.

This has survived Ventana's CI system as well as my tester.  Obviously 
I'll wait for a verdict from the Rivos CI system before moving forward.


Jeffgcc/

* config/riscv/riscv.md (_shift_reverse): New pattern.

gcc/testsuite

* gcc.target/riscv/and-shift32.c; New test.
* gcc.target/riscv/and-shift64.c; New test.


Co-authored-by: Jeffrey A Law 

diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 756095297e4..939c2e3014a 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -2937,6 +2937,33 @@ (define_insn_and_split "*si3_extend_mask"
   [(set_attr "type" "shift")
(set_attr "mode" "SI")])
 
+;; We can reassociate the shift and bitwise operator which may allow us to
+;; reduce the immediate operand of the bitwise operator into a range that
+;; fits in a simm12.
+;;
+;; We need to make sure that shifting does not lose any bits, particularly
+;; for IOR/XOR.  It probably doesn't matter for AND.
+;;
+;; We also don't want to do this if the immediate already fits in a simm12
+;; field.
+(define_insn_and_split "_shift_reverse"
+  [(set (match_operand:X 0 "register_operand" "=r")
+(any_bitwise:X (ashift:X (match_operand:X 1 "register_operand" "r")
+(match_operand 2 "immediate_operand" "n"))
+  (match_operand 3 "immediate_operand" "n")))]
+  "(!SMALL_OPERAND (INTVAL (operands[3]))
+   && SMALL_OPERAND (INTVAL (operands[3]) >> INTVAL (operands[2]))
+   && popcount_hwi (INTVAL (operands[3])) <= popcount_hwi (INTVAL 
(operands[3]) >> INTVAL (operands[2])))"
+  "#"
+  "&& 1"
+  [(set (match_dup 0) (any_bitwise:X (match_dup 1) (match_dup 3)))
+   (set (match_dup 0) (ashift:X (match_dup 0) (match_dup 2)))]
+  {
+operands[3] = GEN_INT (INTVAL (operands[3]) >> INTVAL (operands[2]));
+  }
+  [(set_attr "type" "shift")
+   (set_attr "mode" "")])
+
 ;; Non-canonical, but can be formed by ree when combine is not successful at
 ;; producing one of the two canonical patterns below.
 (define_insn "*lshrsi3_zero_extend_1"
diff --git a/gcc/testsuite/gcc.target/riscv/and-shift32.c 
b/gcc/testsuite/gcc.target/riscv/and-shift32.c
new file mode 100644
index 000..38ee63e8d79
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/and-shift32.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv32gc -mabi=ilp32" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-g" } } */
+
+int foo(int a)
+{
+  return (a << 8) & 24320;
+}
+
+/* { dg-final { scan-assembler-times "\\sandi\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sslli\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sli\\s" } } */
+/* { dg-final { scan-assembler-not "\\saddi\\s" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/riscv/and-shift64.c 
b/gcc/testsuite/gcc.target/riscv/and-shift64.c
new file mode 100644
index 000..ccfaedd508a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/and-shift64.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-g" } } */
+
+long long foo(long long a)
+{
+  return (a << 8) & 24320;
+}
+
+/* { dg-final { scan-assembler-times "\\sandi\\s" 1 } } */
+/* { dg-final { scan-assembler-times "\\sslli\\s" 1 } } */
+/* { dg-final { scan-assembler-not "\\sli\\s" } } */
+/* { dg-final { scan-assembler-not "\\saddi\\s" } } */
\ No newline at end of file


[to-be-committed] [RISC-V] Try inverting for constant synthesis

2024-05-26 Thread Jeff Law
So there's another class of constants we're failing to synthesize well. 
Specifically those where we can invert our original constant C into C' 
and C' takes at least 2 fewer instructions to synthesize than C.  In 
that case we can initially generate C', then use xori with the constant 
-1 to flip all the bits resulting in our target constant.


I've only seen this trigger when the final synthesis is li+srli+xori. 
The original synthesis took on various 4 or 5 instruction forms.


Most of the methods we use to improve constant synthesis are in 
riscv_build_integer_1.  I originally tried to put this code in there. 
But that'll end up with infinite recursion due to some other ADDI 
related code which wants to flip bits and try synthesis.


So this was put into riscv_build_integer and recurses into 
riscv_build_integer.  This isn't unprecedented, just a bit different 
than most of the other synthesis implementation bits.


This doesn't depend on any extensions.  So it should help any rv64 system.

Tested on my tester, obviously I'll wait for a verdict from the CI 
system before moving forward.


Jeff

gcc/

* config/riscv/riscv.cc (riscv_build_integer_one): Verify there
are no bits left to set in the constant when generating bseti.
(riscv_built_integer): Synthesize ~value and if it's cheap use it
with a trailing xori with -1.

gcc/testsuite

* gcc.target/riscv/synthesis-8.c: New test.

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 560a454d7ab..1cca288cb57 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -1023,7 +1023,7 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
 
   /* If LUI+ADDI+BSETI resulted in a more efficient
 sequence, then use it.  */
-  if (i < cost)
+  if (value == 0 && i < cost)
{
  memcpy (codes, alt_codes, sizeof (alt_codes));
  cost = i;
@@ -1075,6 +1075,31 @@ riscv_build_integer (struct riscv_integer_op *codes, 
HOST_WIDE_INT value,
}
 }
 
+  /* See if we can generate the inverted constant, then use
+ not to get the desired constant.
+
+ This can't be in riscv_build_integer_1 as it'll mutually
+ recurse with another case in there.  And it has to recurse
+ into riscv_build_integer so we get the trailing 0s case
+ above.  */
+  if (cost > 2 && value < 0)
+{
+  struct riscv_integer_op alt_codes[RISCV_MAX_INTEGER_OPS];
+  int alt_cost;
+
+  HOST_WIDE_INT nval = ~value;
+  alt_cost = 1 + riscv_build_integer (alt_codes, nval, mode);
+  if (alt_cost < cost)
+   {
+ alt_codes[alt_cost - 1].code = XOR;
+ alt_codes[alt_cost - 1].value = -1;
+ alt_codes[alt_cost - 1].use_uw = false;
+ memcpy (codes, alt_codes, sizeof (alt_codes));
+ cost = alt_cost;
+   }
+}
+
+
   if (!TARGET_64BIT
   && (value > INT32_MAX || value < INT32_MIN))
 {
diff --git a/gcc/testsuite/gcc.target/riscv/synthesis-8.c 
b/gcc/testsuite/gcc.target/riscv/synthesis-8.c
new file mode 100644
index 000..2bcdb4e774d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/synthesis-8.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target rv64 } */
+/* We aggressively skip as we really just need to test the basic synthesis
+   which shouldn't vary based on the optimization level.  -O1 seems to work
+   and eliminates the usual sources of extraneous dead code that would throw
+   off the counts.  */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O2" "-O3" "-Os" "-Oz" "-flto" } } 
*/
+/* { dg-options "-march=rv64gc_zba_zbb_zbs" } */
+
+/* Rather than test for a specific synthesis of all these constants or
+   having thousands of tests each testing one variant, we just test the
+   total number of instructions.
+
+   This isn't expected to change much and any change is worthy of a look.  */
+/* { dg-final { scan-assembler-times 
"\\t(add|addi|bseti|li|ret|sh1add|sh2add|sh3add|slli|srli|xori)" 72 } } */
+
+unsigned long foo_0xc00077ff(void) { return 0xc00077ffUL; }
+unsigned long foo_0xc000b7ff(void) { return 0xc000b7ffUL; }
+unsigned long foo_0xc00137ff(void) { return 0xc00137ffUL; }
+unsigned long foo_0xc00237ff(void) { return 0xc00237ffUL; }
+unsigned long foo_0xc00437ff(void) { return 0xc00437ffUL; }
+unsigned long foo_0xc00837ff(void) { return 0xc00837ffUL; }
+unsigned long foo_0xc01037ff(void) { return 0xc01037ffUL; }
+unsigned long foo_0xc02037ff(void) { return 0xc02037ffUL; }
+unsigned long foo_0xc04037ff(void) { return 0xc04037ffUL; }
+unsigned long foo_0xc08037ff(void) { return 0xc08037ffUL; }
+unsigned long foo_0xc10037ff(void) { return 0xc10037ffUL; }
+unsigned long foo_0xc20037ff(void) { return 0xc20037ffUL; }
+unsigned long 

Re: [PATCHv2 2/2] libiberty/buildargv: handle input consisting of only white space

2024-05-26 Thread Jeff Law




On 2/10/24 10:26 AM, Andrew Burgess wrote:

GDB makes use of the libiberty function buildargv for splitting the
inferior (program being debugged) argument string in the case where
the inferior is not being started under a shell.

I have recently been working to improve this area of GDB, and noticed
some unexpected behaviour to the libiberty function buildargv, when
the input is a string consisting only of white space.

What I observe is that if the input to buildargv is a string
containing only white space, then buildargv will return an argv list
containing a single empty argument, e.g.:

   char **argv = buildargv (" ");
   assert (*argv[0] == '\0');
   assert (argv[1] == NULL);

We get the same output from buildargv if the input is a single space,
or multiple spaces.  Other white space characters give the same
results.

This doesn't seem right to me, and in fact, there appears to be a work
around for this issue in expandargv where we have this code:

   /* If the file is empty or contains only whitespace, buildargv would
  return a single empty argument.  In this context we want no arguments,
  instead.  */
   if (only_whitespace (buffer))
 {
   file_argv = (char **) xmalloc (sizeof (char *));
   file_argv[0] = NULL;
 }
   else
 /* Parse the string.  */
 file_argv = buildargv (buffer);

I think that the correct behaviour in this situation is to return an
empty argv array, e.g.:

   char **argv = buildargv (" ");
   assert (argv[0] == NULL);

And it turns out that this is a trivial change to buildargv.  The diff
does look big, but this is because I've re-indented a block.  Check
with 'git diff -b' to see the minimal changes.  I've also removed the
work around from expandargv.

When testing this sort of thing I normally write the tests first, and
then fix the code.  In this case test-expandargv.c has sort-of been
used as a mechanism for testing the buildargv function (expandargv
does call buildargv most of the time), however, for this particular
issue the work around in expandargv (mentioned above) masked the
buildargv bug.

I did consider adding a new test-buildargv.c file, however, this would
have basically been a copy & paste of test-expandargv.c (with some
minor changes to call buildargv).  This would be fine now, but feels
like we would eventually end up with one file not being updated as
much as the other, and so test coverage would suffer.

Instead, I have added some explicit buildargv testing to the
test-expandargv.c file, this reuses the test input that is already
defined for expandargv.

Of course, once I removed the work around from expandargv then we now
do always call buildargv from expandargv, and so the bug I'm fixing
would impact both expandargv and buildargv, so maybe the new testing
is redundant?  I tend to think more testing is always better, so I've
left it in for now.
So just an FYI.  Sometimes folks include the -b diffs as well for these 
scenarios.  THe problem with doing so (as I recently stumbled over 
myself) is the bots which monitor the list and apply patches get quite 
confused by that practice.  Anyway, just something to be aware of.


As for testing, I tend to agree, more is better unless we're highly 
confident its redundant.  So I'll go with your judgment on 
redundant-ness of the test.


As with the prior patch, you'll need to run it through the usual 
bootstrap/regression cycle and cobble together a ChangeLog.


OK once those things are taken care of.

jeff




Re: [PATCHv2 1/2] libiberty/buildargv: POSIX behaviour for backslash handling

2024-05-26 Thread Jeff Law




On 2/10/24 10:26 AM, Andrew Burgess wrote:

GDB makes use of the libiberty function buildargv for splitting the
inferior (program being debugged) argument string in the case where
the inferior is not being started under a shell.

I have recently been working to improve this area of GDB, and have
tracked done some of the unexpected behaviour to the libiberty
function buildargv, and how it handles backslash escapes.

For reference, I've been mostly reading:

   https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html

The issues that I would like to fix are:

   1. Backslashes within single quotes should not be treated as an
   escape, thus: '\a' should split to \a, retaining the backslash.

   2. Backslashes within double quotes should only act as an escape if
   they are immediately before one of the characters $ (dollar),
   ` (backtick), " (double quote), ` (backslash), or \n (newline).  In
   all other cases a backslash should not be treated as an escape
   character.  Thus: "\a" should split to \a, but "\$" should split to
   $.

   3. A backslash-newline sequence should be treated as a line
   continuation, both the backslash and the newline should be removed.

I've updated libiberty and also added some tests.  All the existing
libiberty tests continue to pass, but I'm not sure if there is more
testing that should be done, buildargv is used within lto-wraper.cc,
so maybe there's some testing folk can suggest that I run?
I'm not immediately aware of other tests that we should be running other 
than the usual bootstrap & regression test.


You do need a ChangeLog entry or the pre-commit hooks are going to 
complain loudly.


OK for the trunk with the usual testing.  I know you're typically on the 
GDB side of things, so if you need help on the testing GCC, let us know.


jeff




Re: [PATCH] Support libcall __float{,un}sibf by SF when it is not supported for _bf16

2024-05-26 Thread Jeff Law




On 12/20/23 4:17 AM, Jin Ma wrote:

We don't have SI -> BF library functions, use SI -> SF -> BF
instead. Although this can also be implemented in a target
machine description, it is more appropriate to move
into target independent code.

gcc/ChangeLog:

* optabs.cc (expand_float): Split SI -> BF into SI -> SF -> BF.
---
  gcc/optabs.cc | 13 +
  1 file changed, 13 insertions(+)

diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index 6a34276c239..c58a0321bbd 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -5727,6 +5727,19 @@ expand_float (rtx to, rtx from, int unsignedp)
if (is_narrower_int_mode (GET_MODE (from), SImode))
from = convert_to_mode (SImode, from, unsignedp);
  
+#ifdef HAVE_SFmode

+  if (REAL_MODE_FORMAT (GET_MODE (to)) == _bfloat_half_format
+ && REAL_MODE_FORMAT (SFmode) == _single_format
+ && GET_MODE (from) == SImode)
+   /* We don't have SI -> BF library functions, use SI -> SF -> BF
+  instead.  */
+   {
+ target = gen_reg_rtx (SFmode);
+ expand_float (target, from, unsignedp);
+ goto done;
+   }
+#endif
Why do you have the #ifdef HAVE_SFmode?  That seems odd, I think the 
only place we do anything like that is in targhooks.  Why did you add 
those cpp conditionals?


Bring the comment "We don't have SI -> BF ..." inside the open curly and 
indent it two more spaces.  That should be more consistent with GCC style.


So generally OK.  I suspect this can move forward once we figure out why 
you added those cpp conditionals and fix the formatting nit.


jeff


Re: [PATCH] gimple-vr-values:Add constraint for gimple-cond optimization

2024-05-26 Thread Jeff Law




On 11/22/23 10:47 PM, Feng Wang wrote:

This patch add another condition for gimple-cond optimization. Refer to
the following test case.
int foo1 (int data, int res)
{
   res = data & 0xf;
   res |= res << 4;
   if (res < 0x22)
 return 0x22;
   return res;
}
with the compilation flag "-march=rv64gc_zba_zbb -mabi=lp64d -O2",
before this patch the compilation result is
foo1:
 andia0,a0,15
 slliw   a5,a0,4
 addwa3,a5,a0
 li  a4,33
 add a0,a5,a0
 bleua3,a4,.L5
 ret
.L5:
 li  a0,34
 ret
after this patch the compilation result is
foo1:
 andia0,a0,15
 slliw   a5,a0,4
 add a5,a5,a0
 li  a0,34
 max a0,a5,a0
 ret
The reason is in the pass_early_vrp, the arg0 of gimple_cond
is replaced,but the PHI node still use the arg0.
The some of evrp pass logs are as follows
  gimple_assign 
   gimple_assign 
   gimple_cond 
 goto ; [INV]
   else
 goto ; [INV]

:
   // predicted unlikely by early return (on trees) predictor.

:
   # gimple_phi <_2, 34(3), res_5(2)>
The arg0 of gimple_cond is replaced by _9,but the gimple_phi still
uses res_5,it will cause optimization fail of PHI node to MAX_EXPR.
So the next_use_is_phi is added to control the replacement.

gcc/ChangeLog:

 * vr-values.cc (next_use_is_phi):
 (simplify_using_ranges::simplify_casted_compare):
 add new function next_use_is_phi to control the replacement.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/zbb-min-max-04.c: New test.
I'm not sure what changed, but AFAICT this patch isn't needed anymore. 
We get the desired code with the trunk compiler.


Jeff



Re: [PATCH] libcpp: Correct typo 'r' -> '\r'

2024-05-26 Thread Jeff Law




On 5/25/24 11:16 AM, Peter Damianov wrote:

libcpp/ChangeLog:
* lex.cc (do_peek_prev): Correct typo in argument to __builtin_expect()

THanks.  I've pushed this to the trunk.
jeff





Re: [PATCH v1] Gen-Match: Fix gen_kids_1 right hand braces mis-alignment

2024-05-26 Thread Jeff Law




On 5/25/24 6:39 PM, pan2...@intel.com wrote:

From: Pan Li 

Notice some mis-alignment for gen_kids_1 right hand braces as below:

   if ((_q50 == _q20 && ! TREE_SIDE_EFFECTS (...
 {
   if ((_q51 == _q21 && ! TREE_SIDE_EFFECTS (...
 {
   {
 tree captures[2] ATTRIBUTE_UNUSED = {...
 {
   res_ops[0] = captures[0];
   res_ops[1] = captures[1];
   if (UNLIKELY (debug_dump)) ...
   return true;
 }
   }
 }
 }
}  // mis-aligned here.
  }

The below test are passed for this patch:
* The x86 bootstrap test.
* The x86 fully regression test.

gcc/ChangeLog:

* genmatch.cc (dt_node::gen_kids_1): Fix indenet mis-aligned.

OK
Thanks.
jeff



[committed] [v2] More logical op simplifications in simplify-rtx.cc

2024-05-25 Thread Jeff Law

This is a revamp of what started as a target specific patch.

Basically xalan (corrected, I originally thought it was perlbench) has a 
bitset implementation with a bit of an oddity.  Specifically setBit will 
clear the bit before it is set:



if (bitToSet < 32)
{
fBits1 &= ~mask;
fBits1 |= mask;
}
 else
{
fBits2 &= ~mask;
fBits2 |= mask;
}


We can clean this up pretty easily in RTL with a small bit of code in 
simplify-rtx.  While xalan doesn't have other cases, we can synthesize 
tests pretty easily and handle them as well.



It turns out we don't actually have to recognize this stuff at the bit 
level, just standard logical identities are sufficient.  For example


(X | Y) & ~Y -> X & ~Y



Andrew P. might poke at this at the gimple level.  The type changes 
kindof get in the way in gimple but he's much better at match.pd than I 
am, so if he wants to chase it from the gimple side, I'll fully support 
that.


Bootstrapped and regression tested on x86.  Also run through my tester 
on its embedded targets.


Pushing to the trunk.

jeff

commit 05daf617ea22e1d818295ed2d037456937e23530
Author: Jeff Law 
Date:   Sat May 25 12:39:05 2024 -0600

[committed] [v2] More logical op simplifications in simplify-rtx.cc

This is a revamp of what started as a target specific patch.

Basically xalan (corrected, I originally thought it was perlbench) has a 
bitset
implementation with a bit of an oddity.  Specifically setBit will clear the 
bit
before it is set:

> if (bitToSet < 32)
> {
> fBits1 &= ~mask;
> fBits1 |= mask;
> }
>  else
> {
> fBits2 &= ~mask;
> fBits2 |= mask;
> }
We can clean this up pretty easily in RTL with a small bit of code in
simplify-rtx.  While xalan doesn't have other cases, we can synthesize tests
pretty easily and handle them as well.

It turns out we don't actually have to recognize this stuff at the bit 
level,
just standard logical identities are sufficient.  For example

(X | Y) & ~Y -> X & ~Y

Andrew P. might poke at this at the gimple level.  The type changes kindof 
get
in the way in gimple but he's much better at match.pd than I am, so if he 
wants
to chase it from the gimple side, I'll fully support that.

Bootstrapped and regression tested on x86.  Also run through my tester on 
its
embedded targets.

Pushing to the trunk.

gcc/

* simplify-rtx.cc (simplify_context::simplify_binary_operation_1): 
Handle
more logical simplifications.

gcc/testsuite/

* g++.target/riscv/redundant-bitmap-1.C: New test.
* g++.target/riscv/redundant-bitmap-2.C: New test.
* g++.target/riscv/redundant-bitmap-3.C: New test.
* g++.target/riscv/redundant-bitmap-4.C: New test.

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 53f54d1d392..5caf1dfd957 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -3549,6 +3549,12 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
return tem;
}
 
+  /* Convert (ior (and (not A) B) A) into A | B.  */
+  if (GET_CODE (op0) == AND
+ && GET_CODE (XEXP (op0, 0)) == NOT
+ && rtx_equal_p (XEXP (XEXP (op0, 0), 0), op1))
+   return simplify_gen_binary (IOR, mode, XEXP (op0, 1), op1);
+
   tem = simplify_byte_swapping_operation (code, mode, op0, op1);
   if (tem)
return tem;
@@ -3801,6 +3807,12 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
return tem;
}
 
+  /* Convert (xor (and (not A) B) A) into A | B.  */
+  if (GET_CODE (op0) == AND
+ && GET_CODE (XEXP (op0, 0)) == NOT
+ && rtx_equal_p (XEXP (XEXP (op0, 0), 0), op1))
+   return simplify_gen_binary (IOR, mode, XEXP (op0, 1), op1);
+
   tem = simplify_byte_swapping_operation (code, mode, op0, op1);
   if (tem)
return tem;
@@ -4006,6 +4018,23 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
  && rtx_equal_p (op1, XEXP (XEXP (op0, 1), 0)))
return simplify_gen_binary (AND, mode, op1, XEXP (op0, 0));
 
+  /* (and (ior/xor (X Y) (not Y)) -> X & ~Y */
+  if ((GET_CODE (op0) == IOR || GET_CODE (op0) == XOR)
+ && GET_CODE (op1) == NOT
+ && rtx_equal_p (XEXP (op1, 0), XEXP (op0, 1)))
+   return simplify_gen_binary (AND, mode, XEXP (op0, 0),
+   simplify_gen_unary (NOT, mode,
+ 

Re: [RFC/RFA] [PATCH 04/12] RISC-V: Add CRC built-ins tests for the target ZBC.

2024-05-25 Thread Jeff Law




On 5/24/24 2:41 AM, Mariam Arutunian wrote:

   gcc/testsuite/gcc.target/riscv/

     * crc-builtin-zbc32.c: New file.
     * crc-builtin-zbc64.c: Likewise.

OK once prerequisites are approved.

jeff



Re: [RFC/RFA] [PATCH 12/12] Add tests for CRC detection and generation.

2024-05-25 Thread Jeff Law




On 5/24/24 2:42 AM, Mariam Arutunian wrote:

   gcc/testsuite/gcc.c-torture/compile/

     * crc-11.c: New test.
     * crc-15.c: Likewise.
     * crc-16.c: Likewise.
     * crc-19.c: Likewise.
     * crc-2.c: Likewise.
     * crc-20.c: Likewise.
     * crc-24.c: Likewise.
     * crc-29.c: Likewise.
     * crc-27.c: Likewise.
     * crc-3.c: Likewise.
     * crc-crc32-data24.c: Likewise.
     * crc-from-fedora-packages (1-24).c: Likewise.
     * crc-linux-(1-5).c: Likewise.
     * crc-not-crc-(1-26).c: Likewise.
     * crc-side-instr-(1-17).c: Likewise.

   gcc/testsuite/gcc.c-torture/execute/

     * crc-(1, 4-10, 12-14, 17-18, 21-28).c: New tests.
     * crc-CCIT-data16-xorOutside_InsideFor.c: Likewise.
     * crc-CCIT-data16.c: Likewise.
     * crc-CCIT-data8.c: Likewise.
     * crc-coremark16-data16.c: Likewise.
     * crc-coremark16-data16.c: Likewise.
     * crc-coremark32-data32.c: Likewise.
     * crc-coremark32-data8.c: Likewise.
     * crc-coremark64-data64.c: Likewise.
     * crc-coremark8-data8.c: Likewise.
     * crc-crc32-data16.c: Likewise.
     * crc-crc32-data8.c: Likewise.
     * crc-crc32.c: Likewise.
     * crc-crc64-data32.c: Likewise.
     * crc-crc64-data64.c: Likewise.
     * crc-crc8-data8-loop-xorInFor.c: Likewise.
     * crc-crc8-data8-loop-xorOutsideFor.c: Likewise.
     * crc-crc8-data8-xorOustideFor.c: Likewise.
     * crc-crc8.c: Likewise.

Signed-off-by: Mariam Arutunian >

OK once all prerequisites are approved.

jeff



Re: [RFC/RFA] [PATCH 03/12] RISC-V: Add CRC expander to generate faster CRC.

2024-05-25 Thread Jeff Law




On 5/24/24 2:41 AM, Mariam Arutunian wrote:
If the target is ZBC or ZBKC, it uses clmul instruction for the CRC 
calculation.

Otherwise, if the target is ZBKB, generates table-based CRC,
but for reversing inputs and the output uses bswap and brev8 instructions.
Add new tests to check CRC generation for ZBC, ZBKC and ZBKB targets.

   gcc/

      * expr.cc (gf2n_poly_long_div_quotient): New function.
      (reflect): Likewise.
      * expr.h (gf2n_poly_long_div_quotient): New function declaration.
      (reflect): Likewise.

   gcc/config/riscv/

      * bitmanip.md (crc_rev4): New expander for 
reversed CRC.

      (crc4): New expander for bit-forward CRC.
      (SUBX1, ANYI1): New iterators.
      * riscv-protos.h (generate_reflecting_code_using_brev): New 
function declaration.

      (expand_crc_using_clmul): Likewise.
      (expand_reversed_crc_using_clmul): Likewise.
      * riscv.cc (generate_reflecting_code_using_brev): New function.
      (expand_crc_using_clmul): Likewise.
      (expand_reversed_crc_using_clmul): Likewise.
      * riscv.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.

   gcc/testsuite/gcc.target/riscv/

         * crc-1-zbc.c: New test.
         * crc-10-zbc.c: Likewise.
         * crc-12-zbc.c: Likewise.
         * crc-13-zbc.c: Likewise.
         * crc-14-zbc.c: Likewise.
         * crc-17-zbc.c: Likewise.
         * crc-18-zbc.c: Likewise.
         * crc-21-zbc.c: Likewise.
         * crc-22-rv64-zbc.c: Likewise.
         * crc-22-zbkb.c: Likewise.
         * crc-23-zbc.c: Likewise.
         * crc-4-zbc.c: Likewise.
         * crc-5-zbc.c: Likewise.
         * crc-5-zbkb.c: Likewise.
         * crc-6-zbc.c: Likewise.
         * crc-7-zbc.c: Likewise.
         * crc-8-zbc.c: Likewise.
         * crc-8-zbkb.c: Likewise.
         * crc-9-zbc.c: Likewise.
         * crc-CCIT-data16-zbc.c: Likewise.
         * crc-CCIT-data8-zbc.c: Likewise.
         * crc-coremark-16bitdata-zbc.c: Likewise.

Signed-off-by: Mariam Arutunian >

diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 8769a6b818b..c98d451f404 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -973,3 +973,66 @@
   "TARGET_ZBC"
   "clmulr\t%0,%1,%2"
   [(set_attr "type" "clmul")])
+
+
+;; Iterator for hardware integer modes narrower than XLEN, same as SUBX
+(define_mode_iterator SUBX1 [QI HI (SI "TARGET_64BIT")])
+
+;; Iterator for hardware integer modes narrower than XLEN, same as ANYI
+(define_mode_iterator ANYI1 [QI HI SI (DI "TARGET_64BIT")])
If these iterators are the same as existing ones, let's just using the 
existing ones. unless we need both SUBX and SUBX1 in the same pattern or 
ANYI/ANYI1.





+
+;; Reversed CRC 8, 16, 32 for TARGET_64
+(define_expand "crc_rev4"
+   ;; return value (calculated CRC)
+  [(set (match_operand:ANYI 0 "register_operand" "=r")
+ ;; initial CRC
+   (unspec:ANYI [(match_operand:ANYI 1 "register_operand" "r")
+ ;; data
+ (match_operand:ANYI1 2 "register_operand" "r")
+ ;; polynomial without leading 1
+ (match_operand:ANYI 3)]
+ UNSPEC_CRC_REV))]
So the preferred formatting for .md files has operands of a given 
operator at the same indention level.  So in this case SET is the 
operator, with two operands (destination/source).  Indent the source and 
destination at the same level.   so


  [(set (match_operand:ANYI 0 ...0)
(unspec: ANYI ...)

Similarly for the reversed expander.



diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 85df5b7ab49..123695033a6 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -11394,7 +11394,7 @@ riscv_expand_usadd (rtx dest, rtx x, rtx y)
   if (mode == HImode || mode == QImode)
 {
   int shift_bits = GET_MODE_BITSIZE (Xmode)
-   - GET_MODE_BITSIZE (mode).to_constant ();
+  - GET_MODE_BITSIZE (mode).to_constant ();
 
   gcc_assert (shift_bits > 0);

Looks like an unrelated spurious change.  Drop.


 
@@ -11415,6 +11415,188 @@ riscv_expand_usadd (rtx dest, rtx x, rtx y)

   emit_move_insn (dest, gen_lowpart (mode, xmode_dest));
 }
 
+/* Generate instruction sequence

+   which reflects the value of the OP using bswap and brev8 instructions.
+   OP's mode may be less than word_mode, to get the correct number,
+   after reflecting we shift right the value by SHIFT_VAL.
+   E.g. we have  0001, after reflection (target 32-bit) we will get
+   1000   , if we shift-out 16 bits,
+   we will get the desired one: 1000 .  */
+
+void
+generate_reflecting_code_using_brev (rtx *op, int shift_val)
+{
+
+  riscv_expand_op (BSWAP, word_mode, *op, *op, *op);
+  riscv_expand_op (LSHIFTRT, word_mode, *op, *op,
+  gen_int_mode (shift_val, word_mode));
Formatting nit with the gen_int_mode (...) argument.  It should line up 
with the 

Re: [RFC/RFA] [PATCH 02/12] Add built-ins and tests for bit-forward and bit-reversed CRCs

2024-05-25 Thread Jeff Law




On 5/24/24 2:41 AM, Mariam Arutunian wrote:
This patch introduces new built-in functions to GCC for computing bit- 
forward and bit-reversed CRCs.

These builtins aim to provide efficient CRC calculation capabilities.
When the target architecture supports CRC operations (as indicated by 
the presence of a CRC optab),

the builtins will utilize the expander to generate CRC code.
In the absence of hardware support, the builtins default to generating 
code for a table-based CRC calculation.


The builtins are defined as follows:
__builtin_rev_crc16_data8,
__builtin_rev_crc32_data8, __builtin_rev_crc32_data16, 
__builtin_rev_crc32_data32

__builtin_crc8_data8,
__builtin_crc16_data16, __builtin_crc16_data8,
__builtin_crc32_data8, __builtin_crc32_data16, __builtin_crc32_data32,
__builtin_crc64_data8, __builtin_crc64_data16,  __builtin_crc64_data32, 
__builtin_crc64_data64


Each builtin takes three parameters:
crc: The initial CRC value.
data: The data to be processed.
polynomial: The CRC polynomial without the leading 1.

To validate the correctness of these builtins, this patch also includes 
additions to the GCC testsuite.
This enhancement allows GCC to offer developers high-performance CRC 
computation options

that automatically adapt to the capabilities of the target hardware.

Co-authored-by: Joern Rennecke >


Not complete. May continue the work if these built-ins are needed.

gcc/

  * builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): Define.
  (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
           (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
           (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
           (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
           (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
           (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
           (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
           (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
           (BT_FN_UINT64_UINT64_UINT64_CONST_SIZE): Likewise.
           * builtins.cc (associated_internal_fn): Handle 
BUILT_IN_CRC8_DATA8,

           BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
           BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, 
BUILT_IN_CRC32_DATA32,
           BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, 
BUILT_IN_CRC64_DATA32,

           BUILT_IN_CRC64_DATA64,
           BUILT_IN_REV_CRC8_DATA8,
           BUILT_IN_REV_CRC16_DATA8, BUILT_IN_REV_CRC16_DATA16,
           BUILT_IN_REV_CRC32_DATA8, BUILT_IN_REV_CRC32_DATA16, 
BUILT_IN_REV_CRC32_DATA32.

           (expand_builtin_crc_table_based): New function.
           (expand_builtin): Handle BUILT_IN_CRC8_DATA8,
           BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
           BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, 
BUILT_IN_CRC32_DATA32,
           BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, 
BUILT_IN_CRC64_DATA32,

           BUILT_IN_CRC64_DATA64,
           BUILT_IN_REV_CRC8_DATA8,
           BUILT_IN_REV_CRC16_DATA8, BUILT_IN_REV_CRC16_DATA16,
           BUILT_IN_REV_CRC32_DATA8, BUILT_IN_REV_CRC32_DATA16, 
BUILT_IN_REV_CRC32_DATA32.

           * builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
           (BUILT_IN_CRC16_DATA8): Likewise.
           (BUILT_IN_CRC16_DATA16): Likewise.
           (BUILT_IN_CRC32_DATA8): Likewise.
           (BUILT_IN_CRC32_DATA16): Likewise.
           (BUILT_IN_CRC32_DATA32): Likewise.
           (BUILT_IN_CRC64_DATA8): Likewise.
           (BUILT_IN_CRC64_DATA16): Likewise.
           (BUILT_IN_CRC64_DATA32): Likewise.
           (BUILT_IN_CRC64_DATA64): Likewise.
           (BUILT_IN_REV_CRC8_DATA8): New builtin.
           (BUILT_IN_REV_CRC16_DATA8): Likewise.
           (BUILT_IN_REV_CRC16_DATA16): Likewise.
           (BUILT_IN_REV_CRC32_DATA8): Likewise.
           (BUILT_IN_REV_CRC32_DATA16): Likewise.
           (BUILT_IN_REV_CRC32_DATA32): Likewise.
           * builtins.h (expand_builtin_crc_table_based): New function 
declaration.

           * doc/extend.texti (__builtin_rev_crc16_data8,
           (__builtin_rev_crc32_data32, __builtin_rev_crc32_data8,
           __builtin_rev_crc32_data16, __builtin_crc8_data8,
           __builtin_crc16_data16, __builtin_crc16_data8,
           __builtin_crc32_data32, __builtin_crc32_data8,
           __builtin_crc32_data16, __builtin_crc64_data64,
           __builtin_crc64_data8, __builtin_crc64_data16,
           __builtin_crc64_data32): Document.

       gcc/testsuite/

          * gcc.c-torture/compile/crc-builtin-rev-target32.c
          * gcc.c-torture/compile/crc-builtin-rev-target64.c
          * gcc.c-torture/compile/crc-builtin-target32.c
          * gcc.c-torture/compile/crc-builtin-target64.c

Signed-off-by: Mariam Arutunian >



diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index f8d94c4b435..b662de91e49 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -2207,7 +2207,24 @@ associated_internal_fn (built_in_function 

Re: [RFC/RFA] [PATCH 01/12] Implement internal functions for efficient CRC computation

2024-05-25 Thread Jeff Law




On 5/24/24 2:41 AM, Mariam Arutunian wrote:
Add two new internal functions (IFN_CRC, IFN_CRC_REV), to provide faster 
CRC generation.

One performs bit-forward and the other bit-reversed CRC computation.
If CRC optabs are supported, they are used for the CRC computation.
Otherwise, table-based CRC is generated.
The supported data and CRC sizes are 8, 16, 32, and 64 bits.
The polynomial is without the leading 1.
A table with 256 elements is used to store precomputed CRCs.
For the reflection of inputs and the output, a simple algorithm involving
SHIFT, AND, and OR operations is used.

Co-authored-by: Joern Rennecke >


gcc/

    * doc/md.texi (crc@var{m}@var{n}4,
    crc_rev@var{m}@var{n}4): Document.
    * expr.cc (generate_crc_table): New function.
    (calculate_table_based_CRC): Likewise.
    (expand_crc_table_based): Likewise.
    (gen_common_operation_to_reflect): Likewise.
    (reflect_64_bit_value): Likewise.
    (reflect_32_bit_value): Likewise.
    (reflect_16_bit_value): Likewise.
    (reflect_8_bit_value): Likewise.
    (generate_reflecting_code_standard): Likewise.
    (expand_reversed_crc_table_based): Likewise.
    * expr.h (generate_reflecting_code_standard): New function declaration.
    (expand_crc_table_based): Likewise.
    (expand_reversed_crc_table_based): Likewise.
    * internal-fn.cc: (crc_direct): Define.
    (direct_crc_optab_supported_p): Likewise.
    (expand_crc_optab_fn): New function
    * internal-fn.def (CRC, CRC_REV): New internal functions.
    * optabs.def (crc_optab, crc_rev_optab): New optabs.

Signed-off-by: Mariam Arutunian >

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 5730bda80dc..be68ef860f9 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -8557,6 +8557,20 @@ operand 2, greater than operand 2 or is unordered with 
operand 2.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{crc@var{m}@var{n}4} instruction pattern

+@item @samp{crc@var{m}@var{n}4}
+Calculate a bit-forward CRC using operands 1, 2 and 3,
+then store the result in operand 0.
+Operands 1 is the initial CRC, operands 2 is the data and operands 3 is the
+polynomial without leading 1.
+Operands 0, 1 and 3 have mode @var{n} and operand 2 has mode @var{m}, where
+both modes are integers.  The size of CRC to be calculated is determined by the
+mode; for example, if @var{n} is 'hi', a CRC16 is calculated.
+
+@cindex @code{crc_rev@var{m}@var{n}4} instruction pattern
+@item @samp{crc_rev@var{m}@var{n}4}
+Similar to @samp{crc@var{m}@var{n}4}, but calculates a bit-reversed CRC.
+
So just to be clear, this is a case where the input (operand 2) may have 
a different mode than the output (operand 0).  That scenario is 
generally discouraged, with a few exceptions (the most common being 
shift counts which are often QImode objects while the 
value-to-be-shifted and the output value are potentially any scalar 
integer mode.


So I don't think this is a problem, just wanted to point it out to 
anyone else that may be looking at this code.




 @end table
 
 @end ifset

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 1baa39b98eb..18368ae6b6c 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -14091,3 +14091,359 @@ int_expr_size (const_tree exp)
 
   return tree_to_shwi (size);

 }
+
+/* Calculate CRC for the initial CRC and given POLYNOMIAL.
+   CRC_BITS is CRC size.  */
+
+static unsigned HOST_WIDE_INT
+calculate_crc (unsigned HOST_WIDE_INT crc,
+ unsigned HOST_WIDE_INT polynomial,
+ unsigned crc_bits)
Just a nit.  Line up the polynomial & crc_bits declarations with the crc 
declaration.




+{
+  crc = crc << (crc_bits - 8);
+  for (int i = 8; i > 0; --i)
+{
+  if ((crc >> (crc_bits - 1)) & 1)
+   crc = (crc << 1) ^ polynomial;
+  else
+   crc <<= 1;
+}
+
+  crc <<=  (sizeof (crc) * BITS_PER_UNIT - crc_bits);
+  crc >>=  (sizeof (crc) * BITS_PER_UNIT - crc_bits);

Another nit.  Just once space after the <<= or >>= operators.



+
+  return crc;
+}
+
+/* Assemble CRC table with 256 elements for the given POLYNOM and CRC_BITS with
+   given ID.
+   ID is the identifier of the table, the name of the table is unique,
+   contains CRC size and the polynomial.
+   POLYNOM is the polynomial used to calculate the CRC table's elements.
+   CRC_BITS is the size of CRC, may be 8, 16, ... . */
+
+rtx
+assemble_crc_table (tree id, unsigned HOST_WIDE_INT polynom, unsigned crc_bits)
+{
+  unsigned table_el_n = 0x100;
+  tree ar = build_array_type (make_unsigned_type (crc_bits),
+ build_index_type (size_int (table_el_n - 1)));

Nit.  Line up build_index_type at the same indention as make_unsigned_type.

Note that with TREE_READONLY set, there is at least some chance that the 
linker will find identical tables and merge them.  I haven't tested 
this, but I know it happens for other objects in the constant pools.




+  sprintf (buf, "crc_table_for_crc_%u_polynomial_" 

Re: [RFC/RFA][PATCH 00/12] CRC optimization

2024-05-24 Thread Jeff Law




On 5/24/24 2:41 AM, Mariam Arutunian wrote:

Hello!

This patch set detects bitwise CRC implementation loops (with branches) 
in the GIMPLE optimizers and replaces them with more optimal CRC 
implementations in RTL. These patches introduce new internal functions, 
built-in functions, and expanders for CRC generation, leveraging 
hardware instructions where available. Additionally, various tests are 
included to check CRC detection and generation.
Thanks so much for getting this process started.  It's a bit quicker 
than I was ready, but no worries.





 2.

Architecture-Specific Expanders:

  * Expanders are added for RISC-V, aarch64, and i386 architectures.
  * These expanders generate CRCs using either carry-less
multiplication instructions or direct CRC instructions, based on
the target architecture's capabilities.
Also note for the wider audience, this work can also generate table 
lookup based CRC implementations.  This has proven exceedingly helpful 
during the testing phase as we were able to run this code on a wide 
variety of the embedded targets to shake out target dependencies.


On Ventana's V1 design the clmul variant was a small, but clear winner 
over the table lookup.  Obviously the bitwise implementation found in 
coremark was the worst performing.


On our V2 design clmul outperforms the table lookup by a wide margin, 
largely due to reduced latency of clmul.



Jeff


Re: [PATCH v2] RISC-V: Introduce -mrvv-allow-misalign.

2024-05-24 Thread Jeff Law




On 5/24/24 5:43 PM, Palmer Dabbelt wrote:



I'm only reading Zicclsm as saying both scalar and vector misaligned
accesses are supported, but nothing about the performance.

I think it was in the vector docs.  It didn't say anything about
performance, just a note that scalar & vector behavior could differ.


Either way, the split naming scheme seems clearer to me.  It also avoids 
getting mixed up by the no-scalar-misaligned, yes-vector-misaligned 
systems if they ever show up.


So if Robin's OK with re-spinning things, let's just go that way?
Works for me.  Hopefully he's offline until Monday as it's rather late 
for him :-)  So we'll pick it back up in the Tuesday meeting.


jeff



Re: [PATCH v2] RISC-V: Introduce -mrvv-allow-misalign.

2024-05-24 Thread Jeff Law




On 5/24/24 5:39 PM, Palmer Dabbelt wrote:

On Fri, 24 May 2024 16:31:48 PDT (-0700), jeffreya...@gmail.com wrote:



On 5/24/24 11:14 AM, Palmer Dabbelt wrote:

On Fri, 24 May 2024 09:19:09 PDT (-0700), Robin Dapp wrote:

We should have something in doc/invoke too, this one is going to be
tricky for users.  We'll also have to define how this interacts with
the existing -mstrict-align.


Addressed the rest in the attached v2 which also fixes tests.
I'm really not sure about -mstrict-align.  I would have hoped that 
using

-mstrict-align we'd never run into any movmisalign situation but that
might be wishful thinking.  Do we need to specify an
interaction, though?  For now the new options disables movmisalign so
if we hit that despite -mstrict-align we'd still not vectorize it.


I think we just need to write it down.  I think there's two ways to
encode this: either we treat scalar and vector as independent, or we
couple them.  If we treat them independently then we end up with four
cases, it's not clear if they're all interesting.  IIUC with this patch
we'd be able to encode

Given the ISA documents them as independent, I think we should follow
suit and allow them to vary independently.


I'm only reading Zicclsm as saying both scalar and vector misaligned 
accesses are supported, but nothing about the performance.
I think it was in the vector docs.  It didn't say anything about 
performance, just a note that scalar & vector behavior could differ.






Seems reasonable to me.  Just having a regular naming scheme for the 
scalar/vector makes it clear what we're doing, and it's not like having 
the extra name for -mscalar-strict-align really costs anything.

That was my thinking -- get the names right should help avoid confusion.

Jeff


Re: [PATCH] RISC-V: Avoid splitting store dataref groups during SLP discovery

2024-05-24 Thread Jeff Law




On 5/23/24 11:52 PM, Richard Biener wrote:



This worked out so I pushed the change.  The gcc.dg/vect/pr97428.c
test is FAILing on RISC-V (it still gets 0 SLP), because of missed
load permutations.  I hope the followup reorg for the load side will
fix this.  It also FAILs gcc.target/riscv/rvv/autovec/struct/struct_vect-4.c
which does excessive assembly scanning on many functions - I'll leave
this for target maintainers to update - there's one or two functions
which we now expect to SLP.
Yea, folks got a bit carried away with the scan body capability. 
Someone will have to follow-up behind you and clean this up a bit.


Thanks for checking it agains the CI system.  While it's a bit on the 
slow side, we are finding its helping catch real issues and keeping the 
testsuite cleaner WRT FAILs.


jeff



Re: [PATCH v2] RISC-V: Introduce -mrvv-allow-misalign.

2024-05-24 Thread Jeff Law




On 5/24/24 11:14 AM, Palmer Dabbelt wrote:

On Fri, 24 May 2024 09:19:09 PDT (-0700), Robin Dapp wrote:

We should have something in doc/invoke too, this one is going to be
tricky for users.  We'll also have to define how this interacts with
the existing -mstrict-align.


Addressed the rest in the attached v2 which also fixes tests.
I'm really not sure about -mstrict-align.  I would have hoped that using
-mstrict-align we'd never run into any movmisalign situation but that
might be wishful thinking.  Do we need to specify an
interaction, though?  For now the new options disables movmisalign so
if we hit that despite -mstrict-align we'd still not vectorize it.


I think we just need to write it down.  I think there's two ways to 
encode this: either we treat scalar and vector as independent, or we 
couple them.  If we treat them independently then we end up with four 
cases, it's not clear if they're all interesting.  IIUC with this patch 
we'd be able to encode
Given the ISA documents them as independent, I think we should follow 
suit and allow them to vary independently.




* -mstrict-align: Both scalar and vector misaligned accesses are 
  unsupported (-mrvv-allow-misalign doesn't matter).  I'm not sure if 
  there's hardware there, but given we have systems that don't support 
  scalar misaligned accesses it seems reasonable to assume they'll also 
  not support vector misaligned accesses.
* -mno-strict-align -mno-rvv-allow-misalign: Scalar misaligned are 
  supported, vector misaligned aren't supported.  This matches our best 
  theory of how the k230 and k1 behave, so it also seems reasonable to 
  support.
* -mno-strict-align -mrvv-allow-misalign: Both scalar and vector 
  misaligned accesses are supported.  This seems reasonable to support 
  as it's how I'd hope big cores end up being designed, though again 
  there's no hardware.
I'd almost lean towards -m[no-]scalar-strict-align and 
-m[no-]vector-strict-align and deprecate -mstrict-align (aliasing it to 
the scalar alignment option).  But I'll go with consensus here.




The fourth case is kind of wacky: scalar misaligned is unsupported, 
vector misaligned is supported.  I'm not really sure why we'd end up 
with a system like that, but HW vendors do wacky things so it's kind of 
hard to predict.
I've worked on one of these :-)  The thinking from the designers was 
unaligned scalar access just wasn't that important, particularly with 
mem* and str* using the vector rather than scalar ops.


jeff





[to-be-committed][v2][RISC-V] Use bclri in constant synthesis

2024-05-23 Thread Jeff Law
Testing with Zbs enabled by default showed a minor logic error.  After 
the loop clearing things with bclri, we can only use the sequence if we 
were able to clear all the necessary bits.  If any bits are still on, 
then the bclr sequence turned out to not be profitable.


--

So this is conceptually similar to how we handled direct generation of
bseti for constant synthesis, but this time for bclr.

In the bclr case, we already have an expander for AND.  So we just
needed to adjust the predicate to accept another class of constant
operands (those with a single bit clear).

With that in place constant synthesis is adjusted so that it counts the
number of bits clear in the high 33 bits of a 64bit word.  If that
number is small relative to the current best cost, then we try to
generate the constant with a lui based sequence for the low half which
implicitly sets the upper 32 bits as well.  Then we bclri one or more of
those upper 33 bits.

So as an example, this code goes from 4 instructions down to 3:

> unsigned long foo_0xfffbf7ff(void) { return 
0xfffbf7ffUL; }




Note the use of 33 bits above.  That's meant to capture cases like this:


> unsigned long foo_0xfffd77ff(void) { return 
0xfffd77ffUL; }




We can use lui+addi+bclri+bclri to synthesize that in 4 instructions
instead of 5.




I'm including a handful of cases covering the two basic ideas above that
were found by the testing code.

And, no, we're not done yet.  I see at least one more notable idiom
missing before exploring zbkb's potential to improve things.

Tested in my tester and waiting on Rivos CI system before moving forward.
gcc/

* config/riscv/predicates.md (arith_operand_or_mode_mask): Renamed to..
(arith_or_mode_mask_or_zbs_operand): New predicate.
* config/riscv/riscv.md (and3): Update predicate for operand 2.
* config/riscv/riscv.cc (riscv_build_integer_1): Use bclri to clear
bits, particularly bits 31..63 when profitable to do so.

gcc/testsuite/

* gcc.target/riscv/synthesis-6.c: New test.

diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
index 8948fbfc363..c1c693c7617 100644
--- a/gcc/config/riscv/predicates.md
+++ b/gcc/config/riscv/predicates.md
@@ -27,12 +27,6 @@ (define_predicate "arith_operand"
   (ior (match_operand 0 "const_arith_operand")
(match_operand 0 "register_operand")))
 
-(define_predicate "arith_operand_or_mode_mask"
-  (ior (match_operand 0 "arith_operand")
-   (and (match_code "const_int")
-(match_test "UINTVAL (op) == GET_MODE_MASK (HImode)
-|| UINTVAL (op) == GET_MODE_MASK (SImode)"
-
 (define_predicate "lui_operand"
   (and (match_code "const_int")
(match_test "LUI_OPERAND (INTVAL (op))")))
@@ -398,6 +392,14 @@ (define_predicate "not_single_bit_mask_operand"
   (and (match_code "const_int")
(match_test "SINGLE_BIT_MASK_OPERAND (~UINTVAL (op))")))
 
+(define_predicate "arith_or_mode_mask_or_zbs_operand"
+  (ior (match_operand 0 "arith_operand")
+   (and (match_test "TARGET_ZBS")
+   (match_operand 0 "not_single_bit_mask_operand"))
+   (and (match_code "const_int")
+(match_test "UINTVAL (op) == GET_MODE_MASK (HImode)
+|| UINTVAL (op) == GET_MODE_MASK (SImode)"
+
 (define_predicate "const_si_mask_operand"
   (and (match_code "const_int")
(match_test "(INTVAL (op) & (GET_MODE_BITSIZE (SImode) - 1))
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 85df5b7ab49..3b32b515fac 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -893,6 +893,40 @@ riscv_build_integer_1 (struct riscv_integer_op 
codes[RISCV_MAX_INTEGER_OPS],
  codes[1].use_uw = false;
  cost = 2;
}
+
+  /* If LUI/ADDI are going to set bits 32..63 and we need a small
+number of them cleared, we might be able to use bclri profitably. 
+
+Note we may allow clearing of bit 31 using bclri.  There's a class
+of constants with that bit clear where this helps.  */
+  else if (TARGET_64BIT
+  && TARGET_ZBS
+  && (32 - popcount_hwi (value & HOST_WIDE_INT_C 
(0x8000))) + 1 < cost)
+   {
+ /* Turn on all those upper bits and synthesize the result.  */
+ HOST_WIDE_INT nval = value | HOST_WIDE_INT_C (0x8000);
+ alt_cost = riscv_build_integer_1 (alt_codes, nval, mode);
+
+ /* Now iterate over the bits we want to clear until the cost is
+too high or we're done.  */
+ nval = value ^ HOST_WIDE_INT_C (-1);
+ nval &= HOST_WIDE_INT_C (~0x7fff);
+ while (nval && alt_cost < cost)
+   {
+ HOST_WIDE_INT bit = ctz_hwi (nval);
+ alt_codes[alt_cost].code = AND;
+ alt_codes[alt_cost].value = ~(1UL << bit);
+ alt_codes[alt_cost].use_uw = false;
+ 

  1   2   3   4   5   6   7   8   9   10   >