date:20231212

While tidying the prototype patch I've done for the reduced testcase
in PR111591 and in that process trying to produce a testcase that
is miscompiled by stack slot coalescing and the TBAA info that
remains un-altered I've realized we do not need to adjust TBAA info.

The following documents this in the place we adjust points-to info
which we do need to adjust.

Pushed.  Feel free to poke holes into the argument.

Richard.

PR middle-end/111591
* cfgexpand.cc (update_alias_info_with_stack_vars): Document
why not adjusting TBAA info on accesses is OK.
---
 gcc/cfgexpand.cc | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 21ba84ab30b..8f6451e44ff 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -786,7 +786,13 @@ add_partitioned_vars_to_ptset (struct pt_solution *pt,
 /* Update points-to sets based on partition info, so we can use them on RTL.
The bitmaps representing stack partitions will be saved until expand,
where partitioned decls used as bases in memory expressions will be
-   rewritten.  */
+   rewritten.
+
+   It is not necessary to update TBAA info on accesses to the coalesced
+   storage since our memory model doesn't allow TBAA to be used for
+   WAW or WAR dependences.  For RAW when the write is to an old object
+   the new object would not have been initialized at the point of the
+   read, invoking undefined behavior.  */
 
 static void
 update_alias_info_with_stack_vars (void)
-- 
2.35.3

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)

On Tue, 12 Dec 2023, Peter Bergner wrote:

> On 12/12/23 8:36 PM, Jason Merrill wrote:
> > This test is failing for me below C++17, I think you need
> > 
> > // { dg-do compile { target c++17 } }
> > or
> > // { dg-require-effective-target c++17 }
> 
> Sorry about that.  Should we do the above or should we just add
> -std=c++17 to dg-options?  ...or do we need to do both?

Just do the above, the C++ testsuite iterates over all standards,
adding -std=c++17 would just run that 5 times.  But the above
properly skips unsupported cases.

Richard.

[PATCH] Force broadcast constant to mem for vec_dup{v4di, v8si, v4df, v8df} when TARGET_AVX2 is not available.

2023-12-12 Thread liuhongt

vpbroadcastd/vpbroadcastq is avaiable under TARGET_AVX2, but
vec_dup{v4di,v8si} pattern is avaiable under AVX with memory operand.
And it will cause LRA/Reload to generate spill and reload if we put
constant in register.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.

gcc/ChangeLog:

PR target/112992
* config/i386/i386-expand.cc
(ix86_convert_const_wide_int_to_broadcast): Don't convert to
broadcast for vec_dup{v4di,v8si} when TARGET_AVX2 is not
available.
(ix86_broadcast_from_constant): Allow broadcast for V4DI/V8SI
when !TARGET_AVX2 since it will be forced to memory later.
(ix86_expand_vector_move): Force constant to mem for
vec_dup{vssi,v4di} when TARGET_AVX2 is not available.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr100865-7a.c: Adjust testcase.
* gcc.target/i386/pr100865-7c.c: Ditto.
* gcc.target/i386/pr112992.c: New test.
---
 gcc/config/i386/i386-expand.cc  | 48 +
 gcc/testsuite/gcc.target/i386/pr100865-7a.c |  3 +-
 gcc/testsuite/gcc.target/i386/pr100865-7c.c |  3 +-
 gcc/testsuite/gcc.target/i386/pr112992.c| 30 +
 4 files changed, 62 insertions(+), 22 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr112992.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index a53d69d5400..fad4f34f905 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -297,6 +297,12 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode 
mode, rtx op)
   if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
 return nullptr;
 
+  unsigned int msize = GET_MODE_SIZE (mode);
+
+  /* Only optimized for vpbroadcast[bwsd]/vbroadcastss with xmm/ymm/zmm.  */
+  if (msize != 16 && msize != 32 && msize != 64)
+return nullptr;
+
   /* Convert CONST_WIDE_INT to a non-standard SSE constant integer
  broadcast only if vector broadcast is available.  */
   if (!TARGET_AVX
@@ -309,18 +315,23 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode 
mode, rtx op)
   HOST_WIDE_INT val = CONST_WIDE_INT_ELT (op, 0);
   HOST_WIDE_INT val_broadcast;
   scalar_int_mode broadcast_mode;
-  if (TARGET_AVX2
+  /* vpbroadcastb zmm requires TARGET_AVX512BW.  */
+  if ((msize == 64 ? TARGET_AVX512BW : TARGET_AVX2)
   && ix86_broadcast (val, GET_MODE_BITSIZE (QImode),
 val_broadcast))
 broadcast_mode = QImode;
-  else if (TARGET_AVX2
+  else if ((msize == 64 ? TARGET_AVX512BW : TARGET_AVX2)
   && ix86_broadcast (val, GET_MODE_BITSIZE (HImode),
  val_broadcast))
 broadcast_mode = HImode;
-  else if (ix86_broadcast (val, GET_MODE_BITSIZE (SImode),
+  /* vbroadcasts[sd] only support memory operand w/o AVX2.
+ When msize == 16, pshufs is used for vec_duplicate.
+ when msize == 64, vpbroadcastd is used, and TARGET_AVX512F must be 
existed.  */
+  else if ((msize != 32 || TARGET_AVX2)
+  && ix86_broadcast (val, GET_MODE_BITSIZE (SImode),
   val_broadcast))
 broadcast_mode = SImode;
-  else if (TARGET_64BIT
+  else if (TARGET_64BIT && (msize != 32 || TARGET_AVX2)
   && ix86_broadcast (val, GET_MODE_BITSIZE (DImode),
  val_broadcast))
 broadcast_mode = DImode;
@@ -596,23 +607,17 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
   && INTEGRAL_MODE_P (mode))
 return nullptr;
 
+  unsigned int msize = GET_MODE_SIZE (mode);
+  unsigned int inner_size = GET_MODE_SIZE (GET_MODE_INNER ((mode)));
+
   /* Convert CONST_VECTOR to a non-standard SSE constant integer
  broadcast only if vector broadcast is available.  */
-  if (!(TARGET_AVX2
-   || (TARGET_AVX
-   && (GET_MODE_INNER (mode) == SImode
-   || GET_MODE_INNER (mode) == DImode))
-   || FLOAT_MODE_P (mode))
-  || standard_sse_constant_p (op, mode))
+  if (standard_sse_constant_p (op, mode))
 return nullptr;
 
-  /* Don't broadcast from a 64-bit integer constant in 32-bit mode.
- We can still put 64-bit integer constant in memory when
- avx512 embed broadcast is available.  */
-  if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT
-  && (!TARGET_AVX512F
- || (GET_MODE_SIZE (mode) == 64 && !TARGET_EVEX512)
- || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL)))
+  /* vpbroadcast[b,w] is available under TARGET_AVX2.
+ or TARGET_AVX512BW for zmm.  */
+  if (inner_size < 4 && !(msize == 64 ? TARGET_AVX512BW : TARGET_AVX2))
 return nullptr;
 
   if (GET_MODE_INNER (mode) == TImode)
@@ -710,7 +715,14 @@ ix86_expand_vector_move (machine_mode mode, rtx operands[])
 constant or scalar mem.  */
  op1 = gen_reg_rtx (mode);
  if (FLOAT_MODE_P (mode)
- || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode))
+ || (!TARGET_64BIT && GET_MODE_INNER (mode) == DImode)
+ /*

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

On Wed, 2023-12-13 at 14:32 +0800, Jiahao Xu wrote:
> 
> 在 2023/12/13 下午2:21, Xi Ruoyao 写道:
> > On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
> > > This test was extracted from the hot functions of 526.blender_r. Setting
> > > LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
> > > instruction count and a 13.4% performance improvement. After applying
> > > the patch mentioned above, the assembly code looks much better with
> > > LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
> > > Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
> > > improved the performance of 526 by 3%. The definition of
> > > LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
> > > the optimizations you made determine how rtl is generated. They are not
> > > conflicting and combining them would yield better results.  Currently, I
> > > have only tested it on 526, and I will continue testing its impact on
> > > the entire SPEC 2017 suite.
> > The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
> > fixed-point only code.  In practice the usage of -ffast-math is very
> > rare ("real" Linux packages invoking floating-point operations often
> > just malfunction with it) and it seems not good to regress common cases
> > with uncommon cases.
> > 
> Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark 
> results in a 1.6% decrease in dynamic instruction count and an overall
> performance improvement of 0.5%. Most of the SPEC2017 int programs 
> experience a decrease in instruction count, and there are no instances
> of performance regression observed.

Ok then.  But add these info into commit message.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

[PATCH] RISC-V: Fix dynamic lmul tests depended on abi

2023-12-12 Thread demin . han

These two tests depend on -mabi.
Other toolchain configs would report:
fatal error: gnu/stubs-ilp32.h: No such file or directory

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c: Fix abi issue
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c: Ditto

Signed-off-by: demin.han 
---
 .../gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c | 4 +++-
 .../gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c
index 8e6610b0e11..7fd397b782e 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c
@@ -1,5 +1,7 @@
 /* { dg-do compile } */
-/* { dg-options "-march=rv32gcv -mabi=ilp32 -O3 -ftree-vectorize --param 
riscv-autovec-lmul=dynamic -Wno-psabi -fdump-tree-vect-details" } */
+/* { dg-options "-O3 -ftree-vectorize --param riscv-autovec-lmul=dynamic 
-Wno-psabi -fdump-tree-vect-details" } */
+/* { dg-additional-options "-march=rv32gcv" { target riscv32*-*-* } } */
+/* { dg-additional-options "-march=rv64gcv" { target riscv64*-*-* } } */
 
 #include "riscv_vector.h"
 
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c
index b3498ad8210..5fd27cb01e1 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c
@@ -1,5 +1,7 @@
 /* { dg-do compile } */
-/* { dg-options "-march=rv32gcv -mabi=ilp32 -O3 -ftree-vectorize --param 
riscv-autovec-lmul=dynamic -fdump-tree-vect-details" } */
+/* { dg-options "-O3 -ftree-vectorize --param riscv-autovec-lmul=dynamic 
-fdump-tree-vect-details" } */
+/* { dg-additional-options "-march=rv32gcv" { target riscv32*-*-* } } */
+/* { dg-additional-options "-march=rv64gcv" { target riscv64*-*-* } } */
 
 #include "riscv_vector.h"
 
-- 
2.43.0

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

2023-12-12 Thread Jiahao Xu




在 2023/12/13 下午2:21, Xi Ruoyao 写道:

On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:

This test was extracted from the hot functions of 526.blender_r. Setting
LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
instruction count and a 13.4% performance improvement. After applying
the patch mentioned above, the assembly code looks much better with
LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
improved the performance of 526 by 3%. The definition of
LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
the optimizations you made determine how rtl is generated. They are not
conflicting and combining them would yield better results.  Currently, I
have only tested it on 526, and I will continue testing its impact on
the entire SPEC 2017 suite.

The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
fixed-point only code.  In practice the usage of -ffast-math is very
rare ("real" Linux packages invoking floating-point operations often
just malfunction with it) and it seems not good to regress common cases
with uncommon cases.

Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark 
results in a 1.6% decrease in dynamic instruction count and an overall 
performance improvement of 0.5%. Most of the SPEC2017 int programs 
experience a decrease in instruction count, and there are no instances 
of performance regression observed.

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
> This test was extracted from the hot functions of 526.blender_r. Setting 
> LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic 
> instruction count and a 13.4% performance improvement. After applying 
> the patch mentioned above, the assembly code looks much better with 
> LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. 
> Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further 
> improved the performance of 526 by 3%. The definition of 
> LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
> the optimizations you made determine how rtl is generated. They are not 
> conflicting and combining them would yield better results.  Currently, I 
> have only tested it on 526, and I will continue testing its impact on 
> the entire SPEC 2017 suite.

The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
fixed-point only code.  In practice the usage of -ffast-math is very
rare ("real" Linux packages invoking floating-point operations often
just malfunction with it) and it seems not good to regress common cases
with uncommon cases.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

Re: [gcc-wwwdocs PATCH] gcc-13/14: Mention recent update for x86_64 backend

2023-12-12 Thread Gerald Pfeifer

On Fri, 8 Dec 2023, Haochen Jiang wrote:
> +++ b/htdocs/gcc-13/changes.html

> +Based on ISA extensions enabled on Alder Lake, the switch further enables
> +the AVX-IFMA, AVX-VNNI-INT8, AVX-NE-CONVERT, CMPccXADD, ENQCMD and UINTR
> +ISA extensions.

Personally I would alphabetically sort all the options, like you have 
mostly done already. Just AVX-VNNI-INT8 and AVX-NE-CONVERT are not.

(Maybe you have a reason, and in any case this should not block this 
patch.)

> +++ b/htdocs/gcc-14/changes.html
> +  New ISA extension support for Intel AVX10.1 was added.
> +  AVX10.1 intrinsics are available via the -mavx10.1 or
> +  -mavx10.1-256 compiler switch with 256 bit vector size
> +  support. 512 bit vector size support for AVX10.1 intrinsics are

We usually write 256-bit and 512-bit as adjectives, cf. 
gcc.gnu.org/codingconventions.html .

> +  Part of new feature support for Intel APX was added, including EGPR,
> +  PUSH2POP2, PPX and NDD. 

Alphabetically?

> APX features are available via the
> +  -mapxf compiler switch.

Could we say "APX is enabled via..." or "APX support is available via..."?

> +  Xeon Phi CPUs support (a.k.a. Knight Landing and Knight Mill) are 
> marked
> +as deprecated. GCC will emit a warning when using the
> +-mavx5124fmaps, -mavx5124vnniw,
> +-mavx512er, -mavx512pf,
> +-mprefetchwt1, -march=knl,
> +-march=knm, -mtune=knl and 
> -mtune=knm
> +compiler switch. The support will be removed in GCC 15.
> +  

I believe "or" instead of "and" will be clearer.

And "compiler switches" (plural).

And just "Support" in the last sentence.

Thanks for submitting these! No need for further review before committing
(a minor variation).

Gerald

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

2023-12-12 Thread Jiahao Xu




在 2023/12/13 上午2:27, Xi Ruoyao 写道:

On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:

On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:

I guess here the problem is floating-point compare instruction is much
more costly than other instructions but the fact is not correctly
modeled yet.  Could you try
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
where I've raised fp_add cost (which is used for estimating floating-
point compare cost) to 5 instructions and see if it solves your problem
without LOGICAL_OP_NON_SHORT_CIRCUIT?

I think this is not the same issue as the cost of floating-point
comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
affects how the short-circuit branch, such as (A AND-IF B), is executed,
and it is not directly related to the cost of floating-point comparison
instructions. I will try to test it using SPECCPU 2017.

The point is if the cost of floating-point comparison is very high, the
middle end *should* short cut floating-point comparisons even if
LOGICAL_OP_NON_SHORT_CIRCUIT = 1.

I've created https://gcc.gnu.org/PR112985.

Another factor regressing the code is we don't have modeled movcf2gr
instruction yet, so we are not really eliding the branches as
LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.

I made up this:

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index a5d0dcd65fe..84d828ebd0f 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -3169,6 +3169,42 @@ (define_insn "s__using_FCCmode"
[(set_attr "type" "fcmp")
 (set_attr "mode" "FCC")])
  
+(define_insn "movcf2gr"

+  [(set (match_operand:GPR 0 "register_operand" "=r")
+   (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
+ (const_int 0))
+ (const_int 1)
+ (const_int 0)))]
+  "TARGET_HARD_FLOAT"
+  "movcf2gr\t%0,%1"
+  [(set_attr "type" "move")
+   (set_attr "mode" "FCC")])
+
+(define_expand "cstore4"
+  [(set (match_operand:SI 0 "register_operand")
+   (match_operator:SI 1 "loongarch_fcmp_operator"
+ [(match_operand:ANYF 2 "register_operand")
+  (match_operand:ANYF 3 "register_operand")]))]
+  ""
+  {
+rtx fcc = gen_reg_rtx (FCCmode);
+rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
+ operands[2], operands[3]);
+
+emit_insn (gen_rtx_SET (fcc, cmp));
+if (TARGET_64BIT)
+  {
+   rtx gpr = gen_reg_rtx (DImode);
+   emit_insn (gen_movcf2grdi (gpr, fcc));
+   emit_insn (gen_rtx_SET (operands[0],
+   lowpart_subreg (SImode, gpr, DImode)));
+  }
+else
+  emit_insn (gen_movcf2grsi (operands[0], fcc));
+
+DONE;
+  })
+
  


  ;;
  ;;  
diff --git a/gcc/config/loongarch/predicates.md 
b/gcc/config/loongarch/predicates.md
index 9e9ce58cb53..83fea08315c 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -590,6 +590,10 @@ (define_predicate "order_operator"
  (define_predicate "loongarch_cstore_operator"
(match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
  
+(define_predicate "loongarch_fcmp_operator"

+  (match_code
+"unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
+
  (define_predicate "small_data_pattern"
(and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
 (match_test "loongarch_small_data_pattern_p (op)")))

and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
= 1):

fld.s   $f1,$r4,0
fld.s   $f0,$r4,4
fld.s   $f3,$r4,8
fld.s   $f2,$r4,12
fcmp.slt.s  $fcc1,$f0,$f3
fcmp.sgt.s  $fcc0,$f1,$f2
movcf2gr$r13,$fcc1
movcf2gr$r12,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L3
fld.s   $f4,$r4,16
fld.s   $f5,$r4,20
or  $r4,$r0,$r0
fcmp.sgt.s  $fcc1,$f1,$f5
fcmp.slt.s  $fcc0,$f0,$f4
movcf2gr$r12,$fcc1
movcf2gr$r13,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L2
fcmp.sgt.s  $fcc1,$f3,$f5
fcmp.slt.s  $fcc0,$f2,$f4
movcf2gr$r4,$fcc1
movcf2gr$r12,$fcc0
or  $r4,$r4,$r12
xori$r4,$r4,1
slli.w  $r4,$r4,0
jr  $r1
.align  4
.L3:
or  $r4,$r0,$r0
.align  4
.L2:
jr  $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

[PATCH] RISC-V: Postpone full available optimization [VSETVL PASS]

2023-12-12 Thread Juzhe-Zhong

Fix VSETVL BUG that AVL is polluted

.L15:
li  a3,9
lui a4,%hi(s)
sw  a3,%lo(j)(t2)
sh  a5,%lo(s)(a4) <--a4 is hold the address of s
beq t0,zero,.L42
sw  t5,8(t4)
vsetvli zero,a4,e8,m8,ta,ma  <<--- a4 as avl

Actually, this vsetvl is redundant.
The root cause we include full available optimization in LCM local data 
computation.

full available optimization should be after LCM computation.

PR target/112929
PR target/112988

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc 
(pre_vsetvl::compute_lcm_local_properties): Remove full available.
(pre_vsetvl::pre_global_vsetvl_info): Add full available optimization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/pr112929.c: New test.
* gcc.target/riscv/rvv/vsetvl/pr112988.c: New test.

---
 gcc/config/riscv/riscv-vsetvl.cc  | 14 +++-
 .../gcc.target/riscv/rvv/vsetvl/pr112929.c| 58 
 .../gcc.target/riscv/rvv/vsetvl/pr112988.c| 69 +++
 3 files changed, 139 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112929.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112988.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index ed5a2b58ab0..6af8d8429ab 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2723,8 +2723,7 @@ pre_vsetvl::compute_lcm_local_properties ()
   vsetvl_info _info = block_info.get_entry_info ();
   vsetvl_info _info = block_info.get_exit_info ();
 
-  if (header_info.valid_p ()
- && (anticipated_exp_p (header_info) || block_info.full_available))
+  if (header_info.valid_p () && anticipated_exp_p (header_info))
bitmap_set_bit (m_antloc[bb_index],
get_expr_index (m_exprs, header_info));
 
@@ -3224,6 +3223,17 @@ pre_vsetvl::pre_global_vsetvl_info ()
   info.set_delete ();
 }
 
+  /* Remove vsetvl infos if all precessors are available to the block.  */
+  for (const bb_info *bb : crtl->ssa->bbs ())
+{
+  vsetvl_block_info _info = get_block_info (bb);
+  if (block_info.empty_p () || !block_info.full_available)
+   continue;
+
+  vsetvl_info  = block_info.get_entry_info ();
+  info.set_delete ();
+}
+
   for (const bb_info *bb : crtl->ssa->bbs ())
 {
   vsetvl_block_info _info = get_block_info (bb);
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112929.c 
b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112929.c
new file mode 100644
index 000..0435e5dbc56
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112929.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3" } */
+
+int printf(char *, ...);
+int a, l, i, p, q, t, n, o;
+int *volatile c;
+static int j;
+static struct pack_1_struct d;
+long e;
+char m = 5;
+short s;
+
+#pragma pack(1)
+struct pack_1_struct {
+  long c;
+  int d;
+  int e;
+  int f;
+  int g;
+  int h;
+  int i;
+} h, r = {1}, *f = , *volatile g;
+
+void add_em_up(int count, ...) {
+  __builtin_va_list ap;
+  __builtin_va_start(ap, count);
+  __builtin_va_end(ap);
+}
+
+int main() {
+  int u;
+  j = 0;
+
+  for (; j < 9; ++j) {
+u = ++t ? a : 0;
+if (u) {
+  int *v = 
+  *v = g || e;
+  *c = 0;
+  *f = h;
+}
+s = l && c;
+o = i;
+d.f || (p = 0);
+q |= n;
+  }
+
+  r = *f;
+
+  add_em_up(1, 1);
+
+  printf("%d\n", m);
+}
+
+/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0"  
no-opts "-Os" no-opts "-Oz" no-opts "-funroll-loops" no-opts "-g" } } } } */
+/* { dg-final { scan-assembler-not {vsetivli} } } */
+/* { dg-final { scan-assembler-times 
{vsetvli\tzero,\s*[a-x0-9]+,\s*e8,\s*m8,\s*t[au],\s*m[au]} 2 { target { no-opts 
"-O0"  no-opts "-Os" no-opts "-Oz" no-opts "-funroll-loops" no-opts "-g" } } } 
} */
+/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,\s*32} 2 { target { 
no-opts "-O0"  no-opts "-Os" no-opts "-Oz" no-opts "-funroll-loops" no-opts 
"-g" } } } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112988.c 
b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112988.c
new file mode 100644
index 000..6f983ef8bb5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr112988.c
@@ -0,0 +1,69 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3" } */
+
+int a = 0;
+int p, q, r, x = 230;
+short d;
+int e[256];
+static struct f w;
+int *c = 
+
+short y(short z) {
+  return z * d;
+}
+
+#pragma pack(1)
+struct f {
+  int g;
+  short h;
+  int j;
+  char k;
+  char l;
+  long m;
+  long n;
+  int o;
+} s = {1}, v, t, *u = , *b = 
+
+void add_em_up(int count, ...) {
+  __builtin_va_list ap;
+  __builtin_va_start(ap, count);
+  __builtin_va_end(ap);
+}
+
+int main() {
+  int i = 0;
+  for (; i < 256; i++)
+e[i] = i;
+
+  p = 0;
+  for (;

RE: [PATCH] [gcc-wwwdocs]gcc-13/14: Mention Intel new ISA and march support

2023-12-12 Thread Gerald Pfeifer

On Mon, 27 Nov 2023, Jiang, Haochen wrote:
>> How about changing this to use "and", as in
>>   "The switch enables the AMX-FP16, PREFETCHI ISA extensions."
>> ?
> Ok for me.

Done and pushed thusly.

Gerald


commit 617a25d7d89a9cce121e85b693eed1ee3f94354b
Author: Gerald Pfeifer 
Date:   Wed Dec 13 13:43:39 2023 +0800

gcc-13: Refine noteo on -march=graniterapids

diff --git a/htdocs/gcc-13/changes.html b/htdocs/gcc-13/changes.html
index 8ef3d639..ee6383a0 100644
--- a/htdocs/gcc-13/changes.html
+++ b/htdocs/gcc-13/changes.html
@@ -593,7 +593,7 @@ You may also want to check out our
   
   GCC now supports the Intel CPU named Granite Rapids through
 -march=graniterapids.
-The switch enables the AMX-FP16, PREFETCHI ISA extensions.
+The switch enables the AMX-FP16 and PREFETCHI ISA extensions.
   
   GCC now supports the Intel CPU named Granite Rapids D through
 -march=graniterapids-d.

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)

On 12/12/23 8:36 PM, Jason Merrill wrote:
> This test is failing for me below C++17, I think you need
> 
> // { dg-do compile { target c++17 } }
> or
> // { dg-require-effective-target c++17 }

Sorry about that.  Should we do the above or should we just add
-std=c++17 to dg-options?  ...or do we need to do both?

Peter

Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]


On 12/12/23 12:50, Jason Merrill wrote:

On 12/12/23 10:24, Jason Merrill wrote:

On 12/12/23 06:15, Jakub Jelinek wrote:

On Tue, Dec 12, 2023 at 02:13:43PM +0300, Alexander Monakov wrote:



On Tue, 12 Dec 2023, Jakub Jelinek wrote:


On Mon, Dec 11, 2023 at 05:00:50PM -0500, Jason Merrill wrote:
In discussion of PR71093 it came up that more clobber_kind options 
would be

useful within the C++ front-end.

gcc/ChangeLog:

* tree-core.h (enum clobber_kind): Rename CLOBBER_EOL to
CLOBBER_STORAGE_END.  Add CLOBBER_STORAGE_BEGIN,
CLOBBER_OBJECT_BEGIN, CLOBBER_OBJECT_END.
* gimple-lower-bitint.cc
* gimple-ssa-warn-access.cc
* gimplify.cc
* tree-inline.cc
* tree-ssa-ccp.cc: Adjust for rename.


Doesn't build_clobber_this in the C++ front-end need to be adjusted 
too?

I think it is used to place clobbers at start of the ctor (should be
CLOBBER_OBJECT_BEGIN in the new nomenclature) and end of the dtor (i.e.
CLOBBER_OBJECT_END).


You're right.


I had been thinking to leave that to Nathaniel's patch, but sure, I'll 
hoist those bits out:


I've now pushed this version of the patch; Nathaniel, do you want to 
rebase on it?


Actually, I'll take care of that.

Jason

Re: [PATCH] RISC-V: Add Zvfbfmin extension to the -march= option

2023-12-12 Thread Palmer Dabbelt


On Tue, 12 Dec 2023 19:24:51 PST (-0800), zengx...@eswincomputing.com wrote:

This patch would like to add new sub extension (aka Zvfbfmin) to the
-march= option. It introduces a new data type BF16.

Depending on different usage scenarios, the Zvfbfmin extension may
depend on 'V' or 'Zve32f'. This patch only implements dependencies
in scenario of Embedded Processor. In scenario of Application
Processor, it is necessary to explicitly indicate the dependent
'V' extension.

You can locate more information about Zvfbfmin from below spec doc.

https://github.com/riscv/riscv-bfloat16/releases/download/20231027/riscv-bfloat16.pdf

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc:
(riscv_implied_info): Add zvfbfmin item.
(riscv_ext_version_table): Ditto.
(riscv_ext_flag_table): Ditto.
* config/riscv/riscv.opt:
(MASK_ZVFBFMIN): New macro.
(MASK_VECTOR_ELEN_BF_16): Ditto.
(TARGET_ZVFBFMIN): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/arch-31.c: New test.
* gcc.target/riscv/arch-32.c: New test.
* gcc.target/riscv/predef-32.c: New test.
* gcc.target/riscv/predef-33.c: New test.
---
 gcc/common/config/riscv/riscv-common.cc|  4 ++
 gcc/config/riscv/riscv.opt |  4 ++
 gcc/testsuite/gcc.target/riscv/arch-31.c   |  5 +++
 gcc/testsuite/gcc.target/riscv/arch-32.c   |  5 +++
 gcc/testsuite/gcc.target/riscv/predef-32.c | 43 ++
 gcc/testsuite/gcc.target/riscv/predef-33.c | 43 ++
 6 files changed, 104 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-31.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-32.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/predef-32.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/predef-33.c

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 4d5a2f874a2..370d00b8f7a 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -151,6 +151,7 @@ static const riscv_implied_info_t riscv_implied_info[] =

   {"zfa", "f"},

+  {"zvfbfmin", "zve32f"},
   {"zvfhmin", "zve32f"},
   {"zvfh", "zve32f"},
   {"zvfh", "zfhmin"},
@@ -313,6 +314,7 @@ static const struct riscv_ext_version 
riscv_ext_version_table[] =

   {"zfh",   ISA_SPEC_CLASS_NONE, 1, 0},
   {"zfhmin",ISA_SPEC_CLASS_NONE, 1, 0},
+  {"zvfbfmin",  ISA_SPEC_CLASS_NONE, 1, 0},
   {"zvfhmin",   ISA_SPEC_CLASS_NONE, 1, 0},
   {"zvfh",  ISA_SPEC_CLASS_NONE, 1, 0},

@@ -1657,6 +1659,7 @@ static const riscv_ext_flag_table_t 
riscv_ext_flag_table[] =
   {"zve64x",   _options::x_riscv_vector_elen_flags, MASK_VECTOR_ELEN_64},
   {"zve64f",   _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_32},
   {"zve64d",   _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_64},
+  {"zvfbfmin", _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_BF_16},
   {"zvfhmin",  _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_16},
   {"zvfh", _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_16},

@@ -1692,6 +1695,7 @@ static const riscv_ext_flag_table_t 
riscv_ext_flag_table[] =

   {"zfhmin",_options::x_riscv_zf_subext, MASK_ZFHMIN},
   {"zfh",   _options::x_riscv_zf_subext, MASK_ZFH},
+  {"zvfbfmin",  _options::x_riscv_zf_subext, MASK_ZVFBFMIN},
   {"zvfhmin",   _options::x_riscv_zf_subext, MASK_ZVFHMIN},
   {"zvfh",  _options::x_riscv_zf_subext, MASK_ZVFH},

diff --git a/gcc/config/riscv/riscv.opt b/gcc/config/riscv/riscv.opt
index 59ce7106ecf..b7c0b72265e 100644
--- a/gcc/config/riscv/riscv.opt
+++ b/gcc/config/riscv/riscv.opt
@@ -285,6 +285,8 @@ Mask(VECTOR_ELEN_FP_64) Var(riscv_vector_elen_flags)

 Mask(VECTOR_ELEN_FP_16) Var(riscv_vector_elen_flags)

+Mask(VECTOR_ELEN_BF_16) Var(riscv_vector_elen_flags)
+
 TargetVariable
 int riscv_zvl_flags

@@ -366,6 +368,8 @@ Mask(ZFHMIN)  Var(riscv_zf_subext)

 Mask(ZFH) Var(riscv_zf_subext)

+Mask(ZVFBFMIN) Var(riscv_zf_subext)
+
 Mask(ZVFHMIN) Var(riscv_zf_subext)

 Mask(ZVFH)Var(riscv_zf_subext)
diff --git a/gcc/testsuite/gcc.target/riscv/arch-31.c 
b/gcc/testsuite/gcc.target/riscv/arch-31.c
new file mode 100644
index 000..5180753b905
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/arch-31.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv32i_zvfbfmin -mabi=ilp32f" } */
+int foo()
+{
+}
diff --git a/gcc/testsuite/gcc.target/riscv/arch-32.c 
b/gcc/testsuite/gcc.target/riscv/arch-32.c
new file mode 100644
index 000..49616832512
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/arch-32.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64iv_zvfbfmin -mabi=lp64d" } */
+int foo()
+{
+}
diff --git a/gcc/testsuite/gcc.target/riscv/predef-32.c 
b/gcc/testsuite/gcc.target/riscv/predef-32.c
new file mode 100644
index 000..7417e0d996f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/predef-32.c
@@ -0,0 +1,43

[PATCH] RISC-V: Don't make Ztso imply A

2023-12-12 Thread Palmer Dabbelt

I can't actually find anything in the ISA manual that makes Ztso imply
A.  In theory the memory ordering is just a different thing that the set
of availiable instructions (ie, Ztso without A would still imply TSO for
loads and stores).  It also seems like a configuration that could be
sane to build: without A it's all but impossible to write any meaningful
multi-core code, and TSO is really cheap for a single core.

That said, I think it's kind of reasonable to provide A to users asking
for Ztso.  So maybe even if this was a mistake it's the right thing to
do?

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc (riscv_implied_info):
Remove {"ztso", "a"}.
---
 gcc/common/config/riscv/riscv-common.cc | 2 --
 1 file changed, 2 deletions(-)

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index f142212f2ed..5f39e5ea462 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -71,8 +71,6 @@ static const riscv_implied_info_t riscv_implied_info[] =
   {"zks", "zksed"},
   {"zks", "zksh"},
 
-  {"ztso", "a"},
-
   {"v", "zvl128b"},
   {"v", "zve64d"},
 
-- 
2.42.1

Re: [PATCH DejaGNU 1/1] Support per-test execution timeout factor

2023-12-12 Thread Jacob Bachmeyer


Maciej W. Rozycki wrote:
Add support for the `test_timeout_factor' global variable letting a test 
case scale the wait timeout used for code execution.  This is useful for 
particularly slow test cases for which increasing the wait timeout 
globally would be excessive.


* baseboards/qemu.exp (qemu_load): Handle `test_timeout_factor'.
* config/gdb-comm.exp (gdb_comm_load): Likewise.
* config/gdb_stub.exp (gdb_stub_load): Likewise.
* config/sim.exp (sim_load): Likewise.
* config/unix.exp (unix_load): Likewise.
	* doc/dejagnu.texi (Local configuration file): Document 
	`test_timeout_factor'.

[...snip full diff...]


First, a minor technical issue:  brace your expr(n) expressions like this:

   set wait_timeout [expr { $wait_timeout * $test_timeout_factor }]

The Tcl expr(n) manpage recommends that style and explains a few 
situations where it is actually required for non-surprising results and 
that Tcl's optimizations work better if the expression to expr is 
braced.  All expr calls in new code in DejaGnu should have the braces.


Second, I need some more explanation how this fits together because I 
have some concerns about confusion between various timeouts.  In your 
introduction to this patch pair, you note that the test execution 
timeout and tool execution timeout are different.  My main concern is 
that "test_timeout_factor" (and for that matter, "test_timeout") may be 
badly named, or we need a more coherent model of testing with DejaGnu.  
(More precisely, we need better documentation...)


The anticipated confusion stems from the question of what exactly is the 
interval of a test?  In other words, what is the interval limited by 
"test_timeout"?  When does the clock start ticking and when does it stop 
before the alarm goes off?  (I have some suspicions that those answers 
are annoyingly counter-intuitive, which means I will have to write more 
documentation...)


Lastly, I note no objection to the dg-test-timeout-factor extension; as 
far as I can tell, dg.exp is designed to be extended in that way, so 
this is a supported extension point instead of an unsupportable monkeypatch.



-- Jacob

[PATCH] RISC-V: Add Zvfbfmin extension to the -march= option

2023-12-12 Thread Xiao Zeng

This patch would like to add new sub extension (aka Zvfbfmin) to the
-march= option. It introduces a new data type BF16.

Depending on different usage scenarios, the Zvfbfmin extension may
depend on 'V' or 'Zve32f'. This patch only implements dependencies
in scenario of Embedded Processor. In scenario of Application
Processor, it is necessary to explicitly indicate the dependent
'V' extension.

You can locate more information about Zvfbfmin from below spec doc.

https://github.com/riscv/riscv-bfloat16/releases/download/20231027/riscv-bfloat16.pdf

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc:
(riscv_implied_info): Add zvfbfmin item.
(riscv_ext_version_table): Ditto.
(riscv_ext_flag_table): Ditto.
* config/riscv/riscv.opt:
(MASK_ZVFBFMIN): New macro.
(MASK_VECTOR_ELEN_BF_16): Ditto.
(TARGET_ZVFBFMIN): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/arch-31.c: New test.
* gcc.target/riscv/arch-32.c: New test.
* gcc.target/riscv/predef-32.c: New test.
* gcc.target/riscv/predef-33.c: New test.
---
 gcc/common/config/riscv/riscv-common.cc|  4 ++
 gcc/config/riscv/riscv.opt |  4 ++
 gcc/testsuite/gcc.target/riscv/arch-31.c   |  5 +++
 gcc/testsuite/gcc.target/riscv/arch-32.c   |  5 +++
 gcc/testsuite/gcc.target/riscv/predef-32.c | 43 ++
 gcc/testsuite/gcc.target/riscv/predef-33.c | 43 ++
 6 files changed, 104 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-31.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-32.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/predef-32.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/predef-33.c

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 4d5a2f874a2..370d00b8f7a 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -151,6 +151,7 @@ static const riscv_implied_info_t riscv_implied_info[] =
 
   {"zfa", "f"},
 
+  {"zvfbfmin", "zve32f"},
   {"zvfhmin", "zve32f"},
   {"zvfh", "zve32f"},
   {"zvfh", "zfhmin"},
@@ -313,6 +314,7 @@ static const struct riscv_ext_version 
riscv_ext_version_table[] =
 
   {"zfh",   ISA_SPEC_CLASS_NONE, 1, 0},
   {"zfhmin",ISA_SPEC_CLASS_NONE, 1, 0},
+  {"zvfbfmin",  ISA_SPEC_CLASS_NONE, 1, 0},
   {"zvfhmin",   ISA_SPEC_CLASS_NONE, 1, 0},
   {"zvfh",  ISA_SPEC_CLASS_NONE, 1, 0},
 
@@ -1657,6 +1659,7 @@ static const riscv_ext_flag_table_t 
riscv_ext_flag_table[] =
   {"zve64x",   _options::x_riscv_vector_elen_flags, MASK_VECTOR_ELEN_64},
   {"zve64f",   _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_32},
   {"zve64d",   _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_64},
+  {"zvfbfmin", _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_BF_16},
   {"zvfhmin",  _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_16},
   {"zvfh", _options::x_riscv_vector_elen_flags, 
MASK_VECTOR_ELEN_FP_16},
 
@@ -1692,6 +1695,7 @@ static const riscv_ext_flag_table_t 
riscv_ext_flag_table[] =
 
   {"zfhmin",_options::x_riscv_zf_subext, MASK_ZFHMIN},
   {"zfh",   _options::x_riscv_zf_subext, MASK_ZFH},
+  {"zvfbfmin",  _options::x_riscv_zf_subext, MASK_ZVFBFMIN},
   {"zvfhmin",   _options::x_riscv_zf_subext, MASK_ZVFHMIN},
   {"zvfh",  _options::x_riscv_zf_subext, MASK_ZVFH},
 
diff --git a/gcc/config/riscv/riscv.opt b/gcc/config/riscv/riscv.opt
index 59ce7106ecf..b7c0b72265e 100644
--- a/gcc/config/riscv/riscv.opt
+++ b/gcc/config/riscv/riscv.opt
@@ -285,6 +285,8 @@ Mask(VECTOR_ELEN_FP_64) Var(riscv_vector_elen_flags)
 
 Mask(VECTOR_ELEN_FP_16) Var(riscv_vector_elen_flags)
 
+Mask(VECTOR_ELEN_BF_16) Var(riscv_vector_elen_flags)
+
 TargetVariable
 int riscv_zvl_flags
 
@@ -366,6 +368,8 @@ Mask(ZFHMIN)  Var(riscv_zf_subext)
 
 Mask(ZFH) Var(riscv_zf_subext)
 
+Mask(ZVFBFMIN) Var(riscv_zf_subext)
+
 Mask(ZVFHMIN) Var(riscv_zf_subext)
 
 Mask(ZVFH)Var(riscv_zf_subext)
diff --git a/gcc/testsuite/gcc.target/riscv/arch-31.c 
b/gcc/testsuite/gcc.target/riscv/arch-31.c
new file mode 100644
index 000..5180753b905
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/arch-31.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv32i_zvfbfmin -mabi=ilp32f" } */
+int foo()
+{
+}
diff --git a/gcc/testsuite/gcc.target/riscv/arch-32.c 
b/gcc/testsuite/gcc.target/riscv/arch-32.c
new file mode 100644
index 000..49616832512
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/arch-32.c
@@ -0,0 +1,5 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64iv_zvfbfmin -mabi=lp64d" } */
+int foo()
+{
+}
diff --git a/gcc/testsuite/gcc.target/riscv/predef-32.c 
b/gcc/testsuite/gcc.target/riscv/predef-32.c
new file mode 100644
index 000..7417e0d996f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/predef-32.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2

[PATCH #2a/2] strub: indirect volatile parms in wrappers

2023-12-12 Thread Alexandre Oliva

[sorry that the previous, unfinished post got through]

On Dec 12, 2023, Richard Biener  wrote:

> On Tue, Dec 12, 2023 at 3:03 AM Alexandre Oliva  wrote:

>> DECL_NOT_GIMPLE_REG_P (arg) = 0;

> I wonder why you clear this at all?

That code seems to be inherited from expand_thunk.
ISTR that flag was not negated when I started the strub implementation,
back in gcc-10.

>> +convert in separate statements.  ???  Should
>> +we drop volatile from the wrapper
>> +instead?  */

> volatile on function parameters are indeed odd beasts.  You could
> also force volatile arguments to be passed indirectly.

Ooh, I like that, thanks!  Regstrapped on x86_64-linux-gnu, on top of
#1/2, now a cleanup that IMHO would still be desirable.


Arrange for strub internal wrappers to pass volatile arguments by
reference to the wrapped bodies.


for  gcc/ChangeLog

PR middle-end/112938
* ipa-strub.cc (pass_ipa_strub::execute): Pass volatile args
by reference to internal strub wrapped bodies.

for  gcc/testsuite/ChangeLog

PR middle-end/112938
* gcc.dg/strub-internal-volatile.c: Check indirection of
volatile args.
---
 gcc/ipa-strub.cc   |   19 +--
 gcc/testsuite/gcc.dg/strub-internal-volatile.c |5 +
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/gcc/ipa-strub.cc b/gcc/ipa-strub.cc
index 45294b0b46bcb..943bb60996fc1 100644
--- a/gcc/ipa-strub.cc
+++ b/gcc/ipa-strub.cc
@@ -2881,13 +2881,14 @@ pass_ipa_strub::execute (function *)
   parm = DECL_CHAIN (parm),
   nparm = DECL_CHAIN (nparm),
   nparmt = nparmt ? TREE_CHAIN (nparmt) : NULL_TREE)
-  if (!(0 /* DECL_BY_REFERENCE (narg) */
-   || is_gimple_reg_type (TREE_TYPE (nparm))
-   || VECTOR_TYPE_P (TREE_TYPE (nparm))
-   || TREE_CODE (TREE_TYPE (nparm)) == COMPLEX_TYPE
-   || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
-   && (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
-   <= 4 * UNITS_PER_WORD
+  if (TREE_THIS_VOLATILE (parm)
+ || !(0 /* DECL_BY_REFERENCE (narg) */
+  || is_gimple_reg_type (TREE_TYPE (nparm))
+  || VECTOR_TYPE_P (TREE_TYPE (nparm))
+  || TREE_CODE (TREE_TYPE (nparm)) == COMPLEX_TYPE
+  || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
+  && (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
+  <= 4 * UNITS_PER_WORD
{
  /* No point in indirecting pointer types.  Presumably they
 won't ever pass the size-based test above, but check the
@@ -3224,9 +3225,7 @@ pass_ipa_strub::execute (function *)
{
  tree tmp = arg;
  /* If ARG is e.g. volatile, we must copy and
-convert in separate statements.  ???  Should
-we drop volatile from the wrapper
-instead?  */
+convert in separate statements.  */
  if (!is_gimple_val (arg))
{
  tmp = create_tmp_reg (TYPE_MAIN_VARIANT
diff --git a/gcc/testsuite/gcc.dg/strub-internal-volatile.c 
b/gcc/testsuite/gcc.dg/strub-internal-volatile.c
index cdfca67616bc8..227406af245cc 100644
--- a/gcc/testsuite/gcc.dg/strub-internal-volatile.c
+++ b/gcc/testsuite/gcc.dg/strub-internal-volatile.c
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-options "-fdump-ipa-strub" } */
 /* { dg-require-effective-target strub } */
 
 void __attribute__ ((strub("internal")))
@@ -8,3 +9,7 @@ f(volatile short) {
 void g(void) {
   f(0);
 }
+
+/* We make volatile parms indirect in the wrapped f.  */
+/* { dg-final { scan-ipa-dump-times "volatile short" 2 "strub" } } */
+/* { dg-final { scan-ipa-dump-times "volatile short int &" 1 "strub" } } */


-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
More tolerance and less prejudice are key for inclusion and diversity
Excluding neuro-others for not behaving ""normal"" is *not* inclusive

[PATCH #2a/2]

2023-12-12 Thread Alexandre Oliva

On Dec 12, 2023, Richard Biener  wrote:

> On Tue, Dec 12, 2023 at 3:03 AM Alexandre Oliva  wrote:

>> DECL_NOT_GIMPLE_REG_P (arg) = 0;

> I wonder why you clear this at all?

That code seems to be inherited from expand_thunk.
ISTR that flag was not negated when I started the strub implementation,
back in gcc-10.

>> +convert in separate statements.  ???  Should
>> +we drop volatile from the wrapper
>> +instead?  */

> volatile on function parameters are indeed odd beasts.  You could
> also force volatile arguments to be passed indirectly.

Ooh, I like that, thanks!  Regstrapped on x86_64-linux-gnu, on top of
#1/2, now a cleanup that IMHO would still be desirable.


strub: indirect volatile parms in wrappers

Arrange for strub internal wrappers to pass volatile arguments by
reference to the wrapped bodies.


for  gcc/ChangeLog

PR middle-end/112938
* ipa-strub.cc (pass_ipa_strub::execute): Pass volatile args
by reference to internal strub wrapped bodies.

for  gcc/testsuite/ChangeLog

PR middle-end/112938
* gcc.dg/strub-internal-volatile.c: Check indirection of
volatile args.
---
 gcc/ipa-strub.cc   |   19 +--
 gcc/testsuite/gcc.dg/strub-internal-volatile.c |5 +
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/gcc/ipa-strub.cc b/gcc/ipa-strub.cc
index 45294b0b46bcb..943bb60996fc1 100644
--- a/gcc/ipa-strub.cc
+++ b/gcc/ipa-strub.cc
@@ -2881,13 +2881,14 @@ pass_ipa_strub::execute (function *)
   parm = DECL_CHAIN (parm),
   nparm = DECL_CHAIN (nparm),
   nparmt = nparmt ? TREE_CHAIN (nparmt) : NULL_TREE)
-  if (!(0 /* DECL_BY_REFERENCE (narg) */
-   || is_gimple_reg_type (TREE_TYPE (nparm))
-   || VECTOR_TYPE_P (TREE_TYPE (nparm))
-   || TREE_CODE (TREE_TYPE (nparm)) == COMPLEX_TYPE
-   || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
-   && (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
-   <= 4 * UNITS_PER_WORD
+  if (TREE_THIS_VOLATILE (parm)
+ || !(0 /* DECL_BY_REFERENCE (narg) */
+  || is_gimple_reg_type (TREE_TYPE (nparm))
+  || VECTOR_TYPE_P (TREE_TYPE (nparm))
+  || TREE_CODE (TREE_TYPE (nparm)) == COMPLEX_TYPE
+  || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
+  && (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (nparm)))
+  <= 4 * UNITS_PER_WORD
{
  /* No point in indirecting pointer types.  Presumably they
 won't ever pass the size-based test above, but check the
@@ -3224,9 +3225,7 @@ pass_ipa_strub::execute (function *)
{
  tree tmp = arg;
  /* If ARG is e.g. volatile, we must copy and
-convert in separate statements.  ???  Should
-we drop volatile from the wrapper
-instead?  */
+convert in separate statements.  */
  if (!is_gimple_val (arg))
{
  tmp = create_tmp_reg (TYPE_MAIN_VARIANT
diff --git a/gcc/testsuite/gcc.dg/strub-internal-volatile.c 
b/gcc/testsuite/gcc.dg/strub-internal-volatile.c
index cdfca67616bc8..227406af245cc 100644
--- a/gcc/testsuite/gcc.dg/strub-internal-volatile.c
+++ b/gcc/testsuite/gcc.dg/strub-internal-volatile.c
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-options "-fdump-ipa-strub" } */
 /* { dg-require-effective-target strub } */
 
 void __attribute__ ((strub("internal")))
@@ -8,3 +9,7 @@ f(volatile short) {
 void g(void) {
   f(0);
 }
+
+/* We make volatile parms indirect in the wrapped f.  */
+/* { dg-final { scan-ipa-dump-times "volatile short" 2 "strub" } } */
+/* { dg-final { scan-ipa-dump-times "volatile short int &" 1 "strub" } } */


-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
More tolerance and less prejudice are key for inclusion and diversity
Excluding neuro-others for not behaving ""normal"" is *not* inclusive

Re: PING^1 [PATCH] range: Workaround different type precision issue between _Float128 and long double [PR112788]

2023-12-12 Thread Kewen.Lin

Hi Jakub & Andrew,

on 2023/12/12 22:42, Jakub Jelinek wrote:
> On Tue, Dec 12, 2023 at 09:33:38AM -0500, Andrew MacLeod wrote:
>> I leave this for the release managers, but I am not opposed to it for this
>> release... It would be nice to remove it for the next release
> 
> I can live with it for GCC 14, so ok, but it is very ugly.

Thanks, pushed as r14-6478-gfda8e2f8292a90.

And yes, I strongly agree that we should get rid of this in next release.

> 
> We should fix it in a better way for GCC 15+.
> I think we shouldn't lie, both on the mode precisions and on type
> precisions.  The middle-end already contains some hacks to make it
> work to some extent on 2 different modes with same precision (for BFmode vs.
> HFmode), on the FE side if we need a target hook the C/C++ FE will use
> to choose type ranks and/or the type for binary operations, so be it.
> It would be also great if rs6000 backend had just 2 modes for 128-bit
> floats, one for IBM double double, one for IEEE quad, not 3 as it has now,
> perhaps with TFmode being a macro that conditionally expands to one or the
> other.  Or do some tweaks in target hooks to keep backwards compatibility
> with mode attribute and similar.

Thanks for all the insightful suggestions, I just filed PR112993 for
further tracking and self-assigned it.

BR,
Kewen

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

2023-12-12 Thread chenglulu




在 2023/12/13 上午2:27, Xi Ruoyao 写道:

On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:

fld.s   $f1,$r4,0
fld.s   $f0,$r4,4
fld.s   $f3,$r4,8
fld.s   $f2,$r4,12
fcmp.slt.s  $fcc1,$f0,$f3
fcmp.sgt.s  $fcc0,$f1,$f2
movcf2gr$r13,$fcc1
movcf2gr$r12,$fcc0


There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles,

MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem.


or  $r12,$r12,$r13
bnez$r12,.L3
fld.s   $f4,$r4,16
fld.s   $f5,$r4,20
or  $r4,$r0,$r0
fcmp.sgt.s  $fcc1,$f1,$f5
fcmp.slt.s  $fcc0,$f0,$f4
movcf2gr$r12,$fcc1
movcf2gr$r13,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L2
fcmp.sgt.s  $fcc1,$f3,$f5
fcmp.slt.s  $fcc0,$f2,$f4
movcf2gr$r4,$fcc1
movcf2gr$r12,$fcc0
or  $r4,$r4,$r12
xori$r4,$r4,1
slli.w  $r4,$r4,0
jr  $r1
.align  4
.L3:
or  $r4,$r0,$r0
.align  4
.L2:
jr  $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)


On 12/12/23 17:50, Peter Bergner wrote:

On 12/12/23 1:26 PM, Richard Biener wrote:

Am 12.12.2023 um 19:51 schrieb Peter Bergner :

On 12/12/23 12:45 PM, Peter Bergner wrote:

+/* PR target/112822 */


Oops, this should be:

/* PR tree-optimization/112822 */

It's fixed on my end.


Ok


Pushed now that Martin has pushed his fix.  Thanks!


This test is failing for me below C++17, I think you need

// { dg-do compile { target c++17 } }
or
// { dg-require-effective-target c++17 }

Jason

[PATCH] i386: Fix PR110790 testcase

2023-12-12 Thread Haochen Jiang

Hi all,

This patch will fix the testcase fail previously introduced.

Approved by another thread:

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640288.html

Pushed to trunk.

Thx,
Haochen

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110790-2.c: Change scan-assembler from shrq
to shr\[qx\].
---
 gcc/testsuite/gcc.target/i386/pr110790-2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/i386/pr110790-2.c 
b/gcc/testsuite/gcc.target/i386/pr110790-2.c
index 16c73cb7465..dbb526308e6 100644
--- a/gcc/testsuite/gcc.target/i386/pr110790-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr110790-2.c
@@ -21,5 +21,5 @@ refmpn_tstbit_bad (mp_srcptr ptr, unsigned long bit)
 shrq%cl, %rax
 andl   $1, %eax
  */
-/* { dg-final { scan-assembler-times "shrq" 2 { target { lp64 } } } } */
+/* { dg-final { scan-assembler-times "shr\[qx\]" 2 { target { lp64 } } } } */
 /* { dg-final { scan-assembler-times "andl" 2 { target { lp64 } } } } */
-- 
2.31.1

RE: [RFC] Intel AVX10.1 Compiler Design and Support

2023-12-12 Thread Jiang, Haochen

> > On the other hand, a new EVEX-capable level might bring earlier adoption
> > of EVEX capabilities to AMD CPUs, which still should be an improvement
> > over AVX2.  This could benefit AMD as well.  So I would really like to
> > see some AMD feedback here.
> >
> > There's also the matter that time scales for EVEX adoption are so long
> > that by then, Intel CPUs may end up supporting and preferring 512 bit
> > vectors again.
> 
> True, there isn't even widespread VEX adoption yet ... and now there's
> APX as the next best thing to target.
> 
> That said, my main point was that x86-64-v4 is "broken" as it appears
> as a dead end - AVX512 is no more, the future is AVX10, but yet we have
> to define x86-64-v5 as something that includes x86-64-v4.
> 
> So, can we un-do x86-64-v4?

As far as I have heard, x86-64-v4 is rarely used. There should be a small
chance to un-do that and not to break too many things. But I am not sure.

Thx,
Haochen

> 
> Richard.
> 
> > Thanks,
> > Florian
> >

[PATCH v2] LoongArch: Modify the check type of the vector builtin function.

2023-12-12 Thread chenxiaolong

On LoongArch architecture, using the latest gcc14 in regression test,
it is found that the vector test cases in vector directory appear FAIL
entries with unmatched pointer types. In order to solve this kind of
problem, the type of the variable in the check result is modified with
the parameter type defined in the vector builtin function.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/vector/simd_correctness_check.h:The variable
types in the check results are modified in conjunction with the
parameter types defined in the vector builtin function.
---
v1->v2:
If an error occurs, output the data in hexadecimal format, and fill the
high part of the result with 0.
---
 .../loongarch/vector/simd_correctness_check.h   | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/gcc/testsuite/gcc.target/loongarch/vector/simd_correctness_check.h 
b/gcc/testsuite/gcc.target/loongarch/vector/simd_correctness_check.h
index eb7fbd59cc7..551340bd51f 100644
--- a/gcc/testsuite/gcc.target/loongarch/vector/simd_correctness_check.h
+++ b/gcc/testsuite/gcc.target/loongarch/vector/simd_correctness_check.h
@@ -8,11 +8,12 @@
   int fail = 0;   \
   for (size_t i = 0; i < sizeof (res) / sizeof (res[0]); ++i) \
 { \
-  long *temp_ref = [i], *temp_res = [i];  \
+  long long *temp_ref = (long long *)[i], \
+   *temp_res = (long long *)[i]; \
   if (abs (*temp_ref - *temp_res) > 0)\
 { \
   printf (" error: %s at line %ld , expected " #ref   \
-  "[%ld]:0x%lx, got: 0x%lx\n",\
+  "[%ld]:0x%016lx, got: 0x%016lx\n",  \
   __FILE__, line, i, *temp_ref, *temp_res);   \
   fail = 1;   \
 } \
@@ -28,11 +29,11 @@
   int fail = 0;   \
   for (size_t i = 0; i < sizeof (res) / sizeof (res[0]); ++i) \
 { \
-  int *temp_ref = [i], *temp_res = [i];   \
+  int *temp_ref = (int *)[i], *temp_res = (int *)[i]; \
   if (abs (*temp_ref - *temp_res) > 0)\
 { \
   printf (" error: %s at line %ld , expected " #ref   \
-  "[%ld]:0x%x, got: 0x%x\n",  \
+  "[%ld]:0x%08x, got: 0x%08x\n",  \
   __FILE__, line, i, *temp_ref, *temp_res);   \
   fail = 1;   \
 } \
@@ -47,8 +48,8 @@
 { \
   if (ref != res) \
 { \
-  printf (" error: %s at line %ld , expected %d, got %d\n", __FILE__, \
-  line, ref, res);\
+  printf (" error: %s at line %ld , expected 0x:%016x",   \
+ "got 0x:%016x\n", __FILE__, line, ref, res);\
 } \
 } \
   while (0)
-- 
2.20.1

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

2023-12-12 Thread chenglulu



在 2023/12/13 上午2:27, Xi Ruoyao 写道:


fld.s   $f1,$r4,0
fld.s   $f0,$r4,4
fld.s   $f3,$r4,8
fld.s   $f2,$r4,12
fcmp.slt.s  $fcc1,$f0,$f3
fcmp.sgt.s  $fcc0,$f1,$f2
movcf2gr$r13,$fcc1
movcf2gr$r12,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L3
fld.s   $f4,$r4,16
fld.s   $f5,$r4,20
or  $r4,$r0,$r0
fcmp.sgt.s  $fcc1,$f1,$f5
fcmp.slt.s  $fcc0,$f0,$f4
movcf2gr$r12,$fcc1
movcf2gr$r13,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L2
fcmp.sgt.s  $fcc1,$f3,$f5
fcmp.slt.s  $fcc0,$f2,$f4
movcf2gr$r4,$fcc1
movcf2gr$r12,$fcc0
or  $r4,$r4,$r12
xori$r4,$r4,1
slli.w  $r4,$r4,0
jr  $r1
.align  4
.L3:
or  $r4,$r0,$r0
.align  4
.L2:
jr  $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.


Patches in attachments can remove the remaining symbol extension 
directives from


the assembly.


[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html

>From 01eea237e13056fad9839219ed1aa70037cd3b60 Mon Sep 17 00:00:00 2001
From: Lulu Cheng 
Date: Fri, 8 Dec 2023 10:16:48 +0800
Subject: [PATCH v1] LoongArch: Optimized some of the symbolic expansion
 instructions generated during bitwise operations

There are two mode iterators defined in the loongarch.md:
	(define_mode_iterator GPR [SI (DI "TARGET_64BIT")])
  and
	(define_mode_iterator X [(SI "!TARGET_64BIT") (DI "TARGET_64BIT")])
Replace the mode in the bit arithmetic from GPR to X.

Since the bitwise operation instruction does not distinguish between 64-bit,
32-bit, etc., it is necessary to perform symbolic expansion if the bitwise
operation is less than 64 bits.
The original definition would have generated a lot of redundant symbolic
extension instructions. This problem is optimized with reference to the
implementation of RISCV.

gcc/ChangeLog:

	* config/loongarch/loongarch.md (one_cmpl2): Replace GPR with X.
	(*nor3): Likewise.
	(nor3): Likewise.
	(*branch_on_bit): Likewise.
	(*branch_on_bit_range): Likewise.
	(*negsi2_extended): New template.
	(*si3_internal): Likewise.
	(*one_cmplsi2_internal): Likewise.
	(*norsi3_internal): Likewise.
	(*nsi_internal): Likewise.
	(bytepick_w__extend): Modify this template according to the
	modified bit operation to make the optimization work.
	* config/loongarch/predicates.md (branch_on_bit_operand): New predicate.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/sign-extend-1.c: New test.
---
 gcc/config/loongarch/loongarch.md | 148 +++---
 gcc/config/loongarch/predicates.md|   5 +
 .../gcc.target/loongarch/sign-extend-1.c  |  21 +++
 3 files changed, 151 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/loongarch/sign-extend-1.c

diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
index 7a101dd64b7..35788deafc7 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -721,7 +721,7 @@ (define_insn "sub3"
 
 (define_insn "sub3"
   [(set (match_operand:GPR 0 "register_operand" "=r")
-	(minus:GPR (match_operand:GPR 1 "register_operand" "rJ")
+	(minus:GPR (match_operand:GPR 1 "register_operand" "r")
 		   (match_operand:GPR 2 "register_operand" "r")))]
   ""
   "sub.\t%0,%z1,%2"
@@ -1327,13 +1327,13 @@ (define_insn "neg2"
   [(set_attr "alu_type"	"sub")
(set_attr "mode" "")])
 
-(define_insn "one_cmpl2"
-  [(set (match_operand:GPR 0 "register_operand" "=r")
-	(not:GPR (match_operand:GPR 1 "register_operand" "r")))]
-  ""
-  "nor\t%0,%.,%1"
-  [(set_attr "alu_type" "not")
-   (set_attr "mode" "")])
+(define_insn "*negsi2_extended"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(sign_extend:DI (neg:SI (match_operand:SI 1 "register_operand" "r"]
+  "TARGET_64BIT"
+  "sub.w\t%0,%.,%1"
+  [(set_attr "alu_type"	"sub")
+   (set_attr "mode" "SI")])
 
 (define_insn "neg2"
   [(set (match_operand:ANYF 0 "register_operand" "=f")
@@ -1353,14 +1353,39 @@ (define_insn "neg2"
 ;;
 
 (define_insn "3"
-  [(set (match_operand:GPR 0 "register_operand" "=r,r")
-	(any_bitwise:GPR (match_operand:GPR 1 "register_operand" "%r,r")
-			 (match_operand:GPR 2 "uns_arith_operand" "r,K")))]
+  [(set (match_operand:X 0 "register_operand" "=r,r")
+	(any_bitwise:X (match_operand:X 1 "register_operand" "%r,r")
+		   (match_operand:X 2 "uns_arith_operand" "r,K")))]
   ""
   "%i2\t%0,%1,%2"
   [(set_attr "type" "logical")

Re: [PATCH] aarch64/expr: Use ccmp when the outer expression is used twice [PR100942]

2023-12-12 Thread Andrew Pinski

On Tue, Dec 12, 2023 at 12:22 AM Andrew Pinski  wrote:
>
> Ccmp is not used if the result of the and/ior is used by both
> a GIMPLE_COND and a GIMPLE_ASSIGN. This improves the code generation
> here by using ccmp in this case.
> Two changes is required, first we need to allow the outer statement's
> result be used more than once.
> The second change is that during the expansion of the gimple, we need
> to try using ccmp. This is needed because we don't use expand the ssa
> name of the lhs but rather expand directly from the gimple.
>
> A small note on the ccmp_4.c testcase, we should be able to get slightly
> better than with this patch but it is one extra instruction compared to
> before.

Just FYI, this pattern shows up a few times in GCC itself even.

Thanks,
Andrew Pinski

>
> Bootstraped and tested on aarch64-linux-gnu with no regressions.
>
> PR target/100942
>
> gcc/ChangeLog:
>
> * ccmp.cc (ccmp_candidate_p): Add outer argument.
> Allow if the outer is true and the lhs is used more
> than once.
> (expand_ccmp_expr): Update call to ccmp_candidate_p.
> * cfgexpand.cc (expand_gimple_stmt_1): Try using ccmp
> for binary assignments.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/ccmp_3.c: New test.
> * gcc.target/aarch64/ccmp_4.c: New test.
>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/ccmp.cc   |  9 +++---
>  gcc/cfgexpand.cc  | 25 
>  gcc/testsuite/gcc.target/aarch64/ccmp_3.c | 20 +
>  gcc/testsuite/gcc.target/aarch64/ccmp_4.c | 35 +++
>  4 files changed, 85 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ccmp_3.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ccmp_4.c
>
> diff --git a/gcc/ccmp.cc b/gcc/ccmp.cc
> index 1bd6fadea35..a274f8c3d53 100644
> --- a/gcc/ccmp.cc
> +++ b/gcc/ccmp.cc
> @@ -92,7 +92,7 @@ ccmp_tree_comparison_p (tree t, basic_block bb)
>
>  /* Check whether G is a potential conditional compare candidate.  */
>  static bool
> -ccmp_candidate_p (gimple *g)
> +ccmp_candidate_p (gimple *g, bool outer = false)
>  {
>tree lhs, op0, op1;
>gimple *gs0, *gs1;
> @@ -109,8 +109,9 @@ ccmp_candidate_p (gimple *g)
>lhs = gimple_assign_lhs (g);
>op0 = gimple_assign_rhs1 (g);
>op1 = gimple_assign_rhs2 (g);
> -  if ((TREE_CODE (op0) != SSA_NAME) || (TREE_CODE (op1) != SSA_NAME)
> -  || !has_single_use (lhs))
> +  if ((TREE_CODE (op0) != SSA_NAME) || (TREE_CODE (op1) != SSA_NAME))
> +return false;
> +  if (!outer && !has_single_use (lhs))
>  return false;
>
>bb = gimple_bb (g);
> @@ -284,7 +285,7 @@ expand_ccmp_expr (gimple *g, machine_mode mode)
>rtx_insn *last;
>rtx tmp;
>
> -  if (!ccmp_candidate_p (g))
> +  if (!ccmp_candidate_p (g, true))
>  return NULL_RTX;
>
>last = get_last_insn ();
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index b860be8bb77..0f9aad8e3eb 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -74,6 +74,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "builtins.h"
>  #include "opts.h"
> +#include "ccmp.h"
>
>  /* Some systems use __main in a way incompatible with its use in gcc, in 
> these
> cases use the macros NAME__MAIN to give a quoted symbol and SYMBOL__MAIN 
> to
> @@ -3972,6 +3973,30 @@ expand_gimple_stmt_1 (gimple *stmt)
> if (GET_CODE (target) == SUBREG && SUBREG_PROMOTED_VAR_P (target))
>   promoted = true;
>
> +   /* Try to expand conditonal compare.  */
> +   if (targetm.gen_ccmp_first
> +   && gimple_assign_rhs_class (assign_stmt) == GIMPLE_BINARY_RHS)
> + {
> +   machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +   gcc_checking_assert (targetm.gen_ccmp_next != NULL);
> +   temp = expand_ccmp_expr (stmt, mode);
> +   if (temp)
> + {
> +   if (promoted)
> + {
> +   int unsignedp = SUBREG_PROMOTED_SIGN (target);
> +   convert_move (SUBREG_REG (target), temp, unsignedp);
> + }
> +   else
> +{
> +   temp = force_operand (temp, target);
> +   if (temp != target)
> + emit_move_insn (target, temp);
> + }
> +   return;
> + }
> + }
> +
> ops.code = gimple_assign_rhs_code (assign_stmt);
> ops.type = TREE_TYPE (lhs);
> switch (get_gimple_rhs_class (ops.code))
> diff --git a/gcc/testsuite/gcc.target/aarch64/ccmp_3.c 
> b/gcc/testsuite/gcc.target/aarch64/ccmp_3.c
> new file mode 100644
> index 000..a2b47fbee14
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ccmp_3.c
> @@ -0,0 +1,20 @@
> +/* { dg-options

Re: [PATCH V4 1/3]rs6000: accurate num_insns_constant_gpr

2023-12-12 Thread Jiufu Guo



Hi,

"Kewen.Lin"  writes:

> Hi Jeff,
>
> on 2023/12/11 11:26, Jiufu Guo wrote:
>> Hi,
>> 
>> Trunk gcc supports more constants to be built via two instructions:
>> e.g. "li/lis; xori/xoris/rldicl/rldicr/rldic".
>> And then num_insns_constant should also be updated.
>> 
>> Function "rs6000_emit_set_long_const" is used to build complicated
>> constants; and "num_insns_constant_gpr" is used to compute 'how
>> many instructions are needed" to build the constant. So, these 
>> two functions should be aligned.
>> 
>> The idea of this patch is: to reuse "rs6000_emit_set_long_const" to
>> compute/record the instruction number(when computing the insn_num, 
>> then do not emit instructions).
>> 
>> Compare with the previous version,
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639491.html
>> this version updates a lambda usage and comments.
>> 
>> Bootstrap & regtest pass ppc64{,le}.
>> Is this ok for trunk?
>
> OK for trunk, thanks for the patience.

Committed via r14-6476.
Thanks for your kind review and great comments!

BR,
Jeff (Jiufu Guo)

>
> BR,
> Kewen
>
>> 
>> BR,
>> Jeff (Jiufu Guo)
>> 
>> gcc/ChangeLog:
>> 
>>  * config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add new
>>  parameter to record number of instructions to build the constant.
>>  (num_insns_constant_gpr): Call rs6000_emit_set_long_const to compute
>>  num_insn.
>> 
>> ---
>>  gcc/config/rs6000/rs6000.cc | 284 ++--
>>  1 file changed, 146 insertions(+), 138 deletions(-)
>> 
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index cee22c359f3..1e3d1f7fc08 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -1115,7 +1115,7 @@ static tree rs6000_handle_longcall_attribute (tree *, 
>> tree, tree, int, bool *);
>>  static tree rs6000_handle_altivec_attribute (tree *, tree, tree, int, bool 
>> *);
>>  static tree rs6000_handle_struct_attribute (tree *, tree, tree, int, bool 
>> *);
>>  static tree rs6000_builtin_vectorized_libmass (combined_fn, tree, tree);
>> -static void rs6000_emit_set_long_const (rtx, HOST_WIDE_INT);
>> +static void rs6000_emit_set_long_const (rtx, HOST_WIDE_INT, int * = 
>> nullptr);
>>  static int rs6000_memory_move_cost (machine_mode, reg_class_t, bool);
>>  static bool rs6000_debug_rtx_costs (rtx, machine_mode, int, int, int *, 
>> bool);
>>  static int rs6000_debug_address_cost (rtx, machine_mode, addr_space_t,
>> @@ -6054,21 +6054,9 @@ num_insns_constant_gpr (HOST_WIDE_INT value)
>>  
>>else if (TARGET_POWERPC64)
>>  {
>> -  HOST_WIDE_INT low = sext_hwi (value, 32);
>> -  HOST_WIDE_INT high = value >> 31;
>> -
>> -  if (high == 0 || high == -1)
>> -return 2;
>> -
>> -  high >>= 1;
>> -
>> -  if (low == 0 || low == high)
>> -return num_insns_constant_gpr (high) + 1;
>> -  else if (high == 0)
>> -return num_insns_constant_gpr (low) + 1;
>> -  else
>> -return (num_insns_constant_gpr (high)
>> -+ num_insns_constant_gpr (low) + 1);
>> +  int num_insns = 0;
>> +  rs6000_emit_set_long_const (nullptr, value, _insns);
>> +  return num_insns;
>>  }
>>  
>>else
>> @@ -10494,14 +10482,13 @@ can_be_built_by_li_and_rldic (HOST_WIDE_INT c, int 
>> *shift, HOST_WIDE_INT *mask)
>>  
>>  /* Subroutine of rs6000_emit_set_const, handling PowerPC64 DImode.
>> Output insns to set DEST equal to the constant C as a series of
>> -   lis, ori and shl instructions.  */
>> +   lis, ori and shl instructions.  If NUM_INSNS is not NULL, then
>> +   only increase *NUM_INSNS as the number of insns, and do not emit
>> +   any insns.  */
>>  
>>  static void
>> -rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
>> +rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c, int *num_insns)
>>  {
>> -  rtx temp;
>> -  int shift;
>> -  HOST_WIDE_INT mask;
>>HOST_WIDE_INT ud1, ud2, ud3, ud4;
>>  
>>ud1 = c & 0x;
>> @@ -10509,168 +10496,189 @@ rs6000_emit_set_long_const (rtx dest, 
>> HOST_WIDE_INT c)
>>ud3 = (c >> 32) & 0x;
>>ud4 = (c >> 48) & 0x;
>>  
>> -  if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 0x8000))
>> -  || (ud4 == 0 && ud3 == 0 && ud2 == 0 && ! (ud1 & 0x8000)))
>> -emit_move_insn (dest, GEN_INT (sext_hwi (ud1, 16)));
>> +  /* This lambda is used to emit one insn or just increase the insn count.
>> + When counting the insn number, no need to emit the insn.  */
>> +  auto count_or_emit_insn = [_insns] (rtx dest_or_insn, rtx src = 
>> nullptr) {
>> +if (num_insns)
>> +  {
>> +(*num_insns)++;
>> +return;
>> +  }
>> +
>> +if (src)
>> +  emit_move_insn (dest_or_insn, src);
>> +else
>> +  emit_insn (dest_or_insn);
>> +  };
>>  
>> -  else if ((ud4 == 0x && ud3 == 0x && (ud2 & 0x8000))
>> -   || (ud4 == 0 && ud3 == 0 && ! (ud2 & 0x8000)))
>> +  if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 0x8000))
>> +  || (ud4 ==

Re: [PATCH V4 2/3] Using pli for constant splitting

2023-12-12 Thread Jiufu Guo



Hi,

"Kewen.Lin"  writes:

> Hi,
>
> on 2023/12/11 11:26, Jiufu Guo wrote:
>> Hi,
>> 
>> For constant building e.g. r120=0x, which does not fit 'li or lis',
>> 'pli' is used to build this constant via 'emit_move_insn'.
>> 
>> While for a complicated constant, e.g. 0xULL, when using
>> 'rs6000_emit_set_long_const' to split the constant recursively, it fails to
>> use 'pli' to build the half part constant: 0x.
>> 
>> 'rs6000_emit_set_long_const' could be updated to use 'pli' to build half
>> part of the constant when necessary.  For example: 0xULL,
>> "pli 3,1717986918; rldimi 3,3,32,0" can be used.
>> 
>> Compare with previous:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639492.html
>> This verion is refreshed and updated testcase name.
>> 
>> Bootstrap pass on ppc64{,le}.
>> Is this ok for trunk?
>
> OK for trunk, thanks!
Committed via r14-6475.

Thanks for your kind review and great comments!

BR,
Jeff (Jiufu Guo)

>
> BR,
> Kewen
>
>> 
>> BR,
>> Jeff (Jiufu Guo)
>> 
>> gcc/ChangeLog:
>> 
>>  * config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add code to use
>>  pli for 34bit constant.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/powerpc/const-build-1.c: New test.
>> 
>> ---
>>  gcc/config/rs6000/rs6000.cc  | 9 -
>>  gcc/testsuite/gcc.target/powerpc/const-build-1.c | 9 +
>>  2 files changed, 17 insertions(+), 1 deletion(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/const-build-1.c
>> 
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 017000a4e02..531c40488b4 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -10511,7 +10511,14 @@ rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT 
>> c, int *num_insns)
>>emit_insn (dest_or_insn);
>>};
>> 
>> -  if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 0x8000))
>> +  if (TARGET_PREFIXED && SIGNED_INTEGER_34BIT_P (c))
>> +{
>> +  /* li/lis/pli */
>> +  count_or_emit_insn (dest, GEN_INT (c));
>> +  return;
>> +}
>> +
>> + if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 0x8000))
>>|| (ud4 == 0 && ud3 == 0 && ud2 == 0 && !(ud1 & 0x8000)))
>>  {
>>/* li */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/const-build-1.c 
>> b/gcc/testsuite/gcc.target/powerpc/const-build-1.c
>> new file mode 100644
>> index 000..7e35f8c507f
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/const-build-1.c
>> @@ -0,0 +1,9 @@
>> +/* { dg-do compile { target lp64 } } */
>> +/* { dg-options "-O2 -mdejagnu-cpu=power10" } */
>> +/* { dg-require-effective-target power10_ok } */
>> +
>> +unsigned long long msk66() { return 0xULL; }
>> +
>> +/* { dg-final { scan-assembler-times {\mpli\M} 1 } } */
>> +/* { dg-final { scan-assembler-not {\mli\M} } } */
>> +/* { dg-final { scan-assembler-not {\mlis\M} } } */

Re: [PATCH] c++: Fix warmth propagation for member function templates


On 12/12/23 14:29, Jason Xu wrote:

Support was recently added for class-level warmth attributes that are
propagated to member functions. The current implementation ignores
member function templates and this patch fixes that.


Thanks!  I'm applying this variant of the patch:
From c762599f112aa3b3c35c6aaac5856560d9282eb0 Mon Sep 17 00:00:00 2001
From: Jason Merrill 
Date: Tue, 12 Dec 2023 14:41:39 -0500
Subject: [PATCH] c++: class hotness attribute and member template
To: gcc-patches@gcc.gnu.org

The FUNCTION_DECL check ignored member function templates.

gcc/cp/ChangeLog:

	* class.cc (propagate_class_warmth_attribute): Handle
	member templates.

gcc/testsuite/ChangeLog:

	* g++.dg/ext/attr-hotness.C: Add member templates.

Co-authored-by: Jason Xu 
---
 gcc/cp/class.cc |  4 ++--
 gcc/testsuite/g++.dg/ext/attr-hotness.C | 16 
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/gcc/cp/class.cc b/gcc/cp/class.cc
index 6fdb56abfb9..1954e0a5ed3 100644
--- a/gcc/cp/class.cc
+++ b/gcc/cp/class.cc
@@ -7805,8 +7805,8 @@ propagate_class_warmth_attribute (tree t)
 
   if (class_has_cold_attr || class_has_hot_attr)
 for (tree f = TYPE_FIELDS (t); f; f = DECL_CHAIN (f))
-  if (TREE_CODE (f) == FUNCTION_DECL)
-	maybe_propagate_warmth_attributes (f, t);
+  if (DECL_DECLARES_FUNCTION_P (f))
+	maybe_propagate_warmth_attributes (STRIP_TEMPLATE (f), t);
 }
 
 tree
diff --git a/gcc/testsuite/g++.dg/ext/attr-hotness.C b/gcc/testsuite/g++.dg/ext/attr-hotness.C
index f9a6930304d..24aa089ead3 100644
--- a/gcc/testsuite/g++.dg/ext/attr-hotness.C
+++ b/gcc/testsuite/g++.dg/ext/attr-hotness.C
@@ -2,15 +2,23 @@
 /* { dg-options "-O0 -Wattributes -fdump-tree-gimple" } */
 
 
-struct __attribute((cold)) A { __attribute((noinline, used)) void foo(void) { } };
+struct __attribute((cold)) A {
+  __attribute((noinline, used)) void foo(void) { }
+  template  void bar() {}
+};
+template void A::bar();
 
-struct __attribute((hot)) B { __attribute((noinline, used)) void foo(void) { } };
+struct __attribute((hot)) B {
+  __attribute((noinline, used)) void foo(void) { }
+  template  void bar() {}
+};
+template void B::bar();
 
 struct __attribute((hot, cold)) C { __attribute((noinline, used)) void foo(void) { } }; /* { dg-warning "ignoring attribute .cold. because it conflicts with attribute .hot." } */
 
 struct __attribute((cold, hot)) D { __attribute((noinline, used)) void foo(void) { } }; /* { dg-warning "ignoring attribute .hot. because it conflicts with attribute .cold." } */
 
 
-/* { dg-final { scan-tree-dump-times "cold" 2 "gimple" } } */
-/* { dg-final { scan-tree-dump-times "hot" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "cold" 3 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "hot" 3 "gimple" } } */
 
-- 
2.39.3

Re: Disable FMADD in chains for Zen4 and generic

2023-12-12 Thread Hongtao Liu

On Tue, Dec 12, 2023 at 10:38 PM Jan Hubicka  wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
>578,500,241  cycles:u #3.645 GHz   
>   ( +-  0.12% )
>753,318,477  instructions:u   #1.30  insn per 
> cycle  ( +-  0.00% )
>125,417,701  branches:u   #  790.227 M/sec 
>   ( +-  0.00% )
>   0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
>
>
> No FMA:
>
>577,573,960  cycles:u #3.514 GHz   
>   ( +-  0.15% )
>878,318,479  instructions:u   #1.52  insn per 
> cycle  ( +-  0.00% )
>125,417,702  branches:u   #  763.035 M/sec 
>   ( +-  0.00% )
>   0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as 
> FMA.
>
> While on zen:
>
>
> With FMA:
>  484875179  cycles:u #3.599 GHz   
>( +-  0.05% )  (82.11%)
>  752031517  instructions:u   #1.55  insn per 
> cycle
>  125106525  branches:u   #  928.712 M/sec 
>( +-  0.03% )  (85.09%)
> 128356  branch-misses:u  #0.10% of all 
> branches  ( +-  0.06% )  (83.58%)
>
> No FMA:
>  375875209  cycles:u #3.592 GHz   
>( +-  0.08% )  (80.74%)
>  875725341  instructions:u   #2.33  insn per 
> cycle
>  124903825  branches:u   #1.194 G/sec 
>( +-  0.04% )  (84.59%)
>   0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.
>
> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.
The generic part LGTM.(It's exactly what we proposed in [1])

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Honza
>
> #include 
> #include 
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
>int i, j, k;
>for(i=0; i{
>   for(j=0; j   {
>  a[i][j] = (float)i + j;
>  b[i][j] = (float)i - j;
>  c[i][j] = 0.0f;
>   }
>}
> }
>
> void mult(void)
> {
>int i, j, k;
>
>for(i=0; i{
>   for(j=0; j   {
>  for(k=0; k  {
> c[i][j] += a[i][k] * b[k][j];
>  }
>   }
>}
> }
>
> int main(void)
> {
>clock_t s, e;
>
>init();
>s=clock();
>mult();
>e=clock();
>printf("mult took %10d clocks\n", (int)(e-s));
>
>return 0;
>
> }
>
> * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> X86_TUNE_AVOID_256FMA_CHAINS)
> Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> "use_scatter_8parts",
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> m_ZNVER2 | m_ZNVER3
> -  | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +  | m_YONGFENG | m_GENERIC)
>
>  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> m_ZNVER3
> - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> m_ZNVER3 | m_ZNVER4
> + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
>  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> smaller FMA chain.  */



-- 
BR,
Hongtao

Re: [PATCH] c++: unifying constants vs their type [PR99186, PR104867]

2023-12-12 Thread Patrick Palka

On Tue, 12 Dec 2023, Patrick Palka wrote:

> Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK
> for trunk?
> 
> -- >8 --
> 
> When unifying constants we need to generally treat constants of
> different types but same value as different, in light of auto template
> parameters.  This patch fixes this in a minimal way; it seems we could
> get away with just using template_args_equal here, as we do in the

or just cp_tree_equal for that matters because ...

> default case, but that's a simplification we could look into during next
> stage 1.
> 
>   PR c++/99186
>   PR c++/104867
> 
> gcc/cp/ChangeLog:
> 
>   * pt.cc (unify) : Compare types as well.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.dg/cpp1z/nontype-auto23.C: New test.
>   * g++.dg/cpp1z/nontype-auto24.C: New test.
> ---
>  gcc/cp/pt.cc|  2 ++
>  gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C | 23 +
>  gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C | 18 
>  3 files changed, 43 insertions(+)
>  create mode 100644 gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C
> 
> diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
> index a8966e223f1..602dd02d29d 100644
> --- a/gcc/cp/pt.cc
> +++ b/gcc/cp/pt.cc
> @@ -24709,6 +24709,8 @@ unify (tree tparms, tree targs, tree parm, tree arg, 
> int strict,
>/* Type INTEGER_CST can come from ordinary constant template args.  */
>  case INTEGER_CST:
>  case REAL_CST:
> +  if (!same_type_p (TREE_TYPE (parm), TREE_TYPE (arg)))
> + return unify_template_argument_mismatch (explain_p, parm, arg);
>while (CONVERT_EXPR_P (arg))
>   arg = TREE_OPERAND (arg, 0);

... this while loop seems to be dead code.

>  
> diff --git a/gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C 
> b/gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C
> new file mode 100644
> index 000..467559ffdda
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C
> @@ -0,0 +1,23 @@
> +// PR c++/99186
> +// { dg-do compile { target c++17 } }
> +
> +template
> +struct tuple_impl : tuple_impl { };
> +
> +template
> +struct tuple_impl { };
> +
> +template
> +struct tuple : tuple_impl<0, T, U> { };
> +
> +template
> +void get(const tuple_impl&);
> +
> +template
> +struct S;
> +
> +int main() {
> +   tuple,S<1U>> x;
> +   get>(x);
> +   get>(x);
> +}
> diff --git a/gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C 
> b/gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C
> new file mode 100644
> index 000..52e4c134ccd
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C
> @@ -0,0 +1,18 @@
> +// PR c++/104867
> +// { dg-do compile { target c++17 } }
> +
> +enum class Foo { A1 };
> +
> +enum class Bar { B1 };
> +
> +template struct enum_;
> +
> +template struct list { };
> +
> +template void f(list, V>);
> +
> +struct enum_type_map : list, int>, list, 
> double> { };
> +
> +int main() {
> +  f(enum_type_map());
> +}
> -- 
> 2.43.0.76.g1a87c842ec
> 
>

[PATCH] libcpp: Fix macro expansion for argument of __has_include [PR110558]

2023-12-12 Thread Lewis Hyatt

Hello-

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110558

This is a small fix for the libcpp issue noted in the PR. Bootstrap +
regtest all languages on x86-64 Linux. Is it ok for trunk please?

Also, it's not a regression, having never worked since __has_include was
introduced in GCC 5, but FWIW the fix would backport fine to all branches
since then... so I think backport to 11,12,13 would make sense assuming the
patch is OK. Thanks!

-Lewis

-- >8 --

When the file name for a #include directive is the result of stringifying a
macro argument, libcpp needs to take some care to get the whitespace
correct; in particular stringify_arg() needs to see a CPP_PADDING token
between macro tokens so that it can figure out when to output space between
tokens. The CPP_PADDING tokens are not normally generated when handling a
preprocessor directive, but for #include-like directives, libcpp sets the
state variable pfile->state.directive_wants_padding to TRUE so that the
CPP_PADDING tokens will be output, and then everything works fine for
computed includes.

As the PR points out, things do not work fine for __has_include. Fix that by
setting the state variable the same as is done for #include.

libcpp/ChangeLog:

PR preprocessor/110558
* macro.cc (builtin_has_include): Set
pfile->state.directive_wants_padding prior to lexing the
file name, in case it comes from macro expansion.

gcc/testsuite/ChangeLog:

PR preprocessor/110558
* c-c++-common/cpp/has-include-2.c: New test.
* c-c++-common/cpp/has-include-2.h: New test.
---
 libcpp/macro.cc|  3 +++
 gcc/testsuite/c-c++-common/cpp/has-include-2.c | 12 
 gcc/testsuite/c-c++-common/cpp/has-include-2.h |  1 +
 3 files changed, 16 insertions(+)
 create mode 100644 gcc/testsuite/c-c++-common/cpp/has-include-2.c
 create mode 100644 gcc/testsuite/c-c++-common/cpp/has-include-2.h

diff --git a/libcpp/macro.cc b/libcpp/macro.cc
index 6f24a9d6f3a..15140c60023 100644
--- a/libcpp/macro.cc
+++ b/libcpp/macro.cc
@@ -398,6 +398,8 @@ builtin_has_include (cpp_reader *pfile, cpp_hashnode *op, 
bool has_next)
   NODE_NAME (op));
 
   pfile->state.angled_headers = true;
+  const auto sav_padding = pfile->state.directive_wants_padding;
+  pfile->state.directive_wants_padding = true;
   const cpp_token *token = cpp_get_token_no_padding (pfile);
   bool paren = token->type == CPP_OPEN_PAREN;
   if (paren)
@@ -406,6 +408,7 @@ builtin_has_include (cpp_reader *pfile, cpp_hashnode *op, 
bool has_next)
 cpp_error (pfile, CPP_DL_ERROR,
   "missing '(' before \"%s\" operand", NODE_NAME (op));
   pfile->state.angled_headers = false;
+  pfile->state.directive_wants_padding = sav_padding;
 
   bool bracket = token->type != CPP_STRING;
   char *fname = NULL;
diff --git a/gcc/testsuite/c-c++-common/cpp/has-include-2.c 
b/gcc/testsuite/c-c++-common/cpp/has-include-2.c
new file mode 100644
index 000..5cd00cb3fb5
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/cpp/has-include-2.c
@@ -0,0 +1,12 @@
+/* PR preprocessor/110558 */
+/* { dg-do preprocess } */
+#define STRINGIZE(x) #x
+#define GET_INCLUDE(i) STRINGIZE(has-include-i.h)
+/* Spaces surrounding the macro args previously caused a problem for 
__has_include().  */
+#if __has_include(GET_INCLUDE(2)) && __has_include(GET_INCLUDE( 2)) && 
__has_include(GET_INCLUDE( 2 ))
+#include GET_INCLUDE(2)
+#include GET_INCLUDE( 2)
+#include GET_INCLUDE( 2 )
+#else
+#error "__has_include did not handle padding properly" /* { dg-bogus 
"__has_include" } */
+#endif
diff --git a/gcc/testsuite/c-c++-common/cpp/has-include-2.h 
b/gcc/testsuite/c-c++-common/cpp/has-include-2.h
new file mode 100644
index 000..57c402b32a8
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/cpp/has-include-2.h
@@ -0,0 +1 @@
+/* PR preprocessor/110558 */

Re: [PATCH GCC 1/1] testsuite: Support test execution timeout factor as a keyword

2023-12-12 Thread Jeff Law





On 12/12/23 07:04, Maciej W. Rozycki wrote:

Add support for the `dg-test-timeout-factor' keyword letting a test
case scale the wait timeout used for code execution, analogously to
`dg-timeout-factor' used for code compilation.  This is useful for
particularly slow test cases for which increasing the wait timeout
globally would be excessive.

gcc/testsuite/
* lib/timeout-dg.exp (dg-test-timeout-factor): New procedure.

OK
jeff

Re: [PATCH DejaGNU 1/1] Support per-test execution timeout factor

2023-12-12 Thread Jeff Law





On 12/12/23 07:04, Maciej W. Rozycki wrote:

Add support for the `test_timeout_factor' global variable letting a test
case scale the wait timeout used for code execution.  This is useful for
particularly slow test cases for which increasing the wait timeout
globally would be excessive.

* baseboards/qemu.exp (qemu_load): Handle `test_timeout_factor'.
* config/gdb-comm.exp (gdb_comm_load): Likewise.
* config/gdb_stub.exp (gdb_stub_load): Likewise.
* config/sim.exp (sim_load): Likewise.
* config/unix.exp (unix_load): Likewise.
* doc/dejagnu.texi (Local configuration file): Document
`test_timeout_factor'.

OK
jeff

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)

On 12/12/23 1:26 PM, Richard Biener wrote:
>> Am 12.12.2023 um 19:51 schrieb Peter Bergner :
>>
>> On 12/12/23 12:45 PM, Peter Bergner wrote:
>>> +/* PR target/112822 */
>>
>> Oops, this should be:
>>
>> /* PR tree-optimization/112822 */
>>
>> It's fixed on my end.
> 
> Ok

Pushed now that Martin has pushed his fix.  Thanks!

Peter

[PATCH v3] c++: fix ICE with sizeof in a template [PR112869]

2023-12-12 Thread Marek Polacek

On Fri, Dec 08, 2023 at 11:09:15PM -0500, Jason Merrill wrote:
> On 12/8/23 16:15, Marek Polacek wrote:
> > On Fri, Dec 08, 2023 at 12:09:18PM -0500, Jason Merrill wrote:
> > > On 12/5/23 15:31, Marek Polacek wrote:
> > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > > > 
> > > > -- >8 --
> > > > This test shows that we cannot clear *walk_subtrees in
> > > > cp_fold_immediate_r when we're in_immediate_context, because that,
> > > > as the comment says, affects cp_fold_r as well.  Here we had an
> > > > expression with
> > > > 
> > > > min ((long int) VIEW_CONVERT_EXPR(bytecount), 
> > > > (long int) <<< Unknown tree: sizeof_expr
> > > >   (int) <<< error >>> >>>)
> > > > 
> > > > as its sub-expression, and we never evaluated that into
> > > > 
> > > > min ((long int) bytecount, 4)
> > > > 
> > > > so the SIZEOF_EXPR leaked into the middle end.
> > > > 
> > > > (There's still one *walk_subtrees = 0; in cp_fold_immediate_r, but that
> > > > one should be OK.)
> > > > 
> > > > PR c++/112869
> > > > 
> > > > gcc/cp/ChangeLog:
> > > > 
> > > > * cp-gimplify.cc (cp_fold_immediate_r): Don't clear 
> > > > *walk_subtrees
> > > > for unevaluated operands.
> > > 
> > > I agree that we want this change for in_immediate_context (), but I don't
> > > see why we want it for TYPE_P or unevaluated_p (code) or
> > > cp_unevaluated_operand?
> > 
> > No particular reason, just paranoia.  How's this?
> > 
> > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
> > 
> > -- >8 --
> > This test shows that we cannot clear *walk_subtrees in
> > cp_fold_immediate_r when we're in_immediate_context, because that,
> > as the comment says, affects cp_fold_r as well.  Here we had an
> > expression with
> > 
> >min ((long int) VIEW_CONVERT_EXPR(bytecount), (long 
> > int) <<< Unknown tree: sizeof_expr
> >  (int) <<< error >>> >>>)
> > 
> > as its sub-expression, and we never evaluated that into
> > 
> >min ((long int) bytecount, 4)
> > 
> > so the SIZEOF_EXPR leaked into the middle end.
> > 
> > (There's still one *walk_subtrees = 0; in cp_fold_immediate_r, but that
> > one should be OK.)
> > 
> > PR c++/112869
> > 
> > gcc/cp/ChangeLog:
> > 
> > * cp-gimplify.cc (cp_fold_immediate_r): Don't clear *walk_subtrees
> > for in_immediate_context.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * g++.dg/template/sizeof18.C: New test.
> > ---
> >   gcc/cp/cp-gimplify.cc| 6 +-
> >   gcc/testsuite/g++.dg/template/sizeof18.C | 8 
> >   2 files changed, 13 insertions(+), 1 deletion(-)
> >   create mode 100644 gcc/testsuite/g++.dg/template/sizeof18.C
> > 
> > diff --git a/gcc/cp/cp-gimplify.cc b/gcc/cp/cp-gimplify.cc
> > index 5abb91bbdd3..6af7c787372 100644
> > --- a/gcc/cp/cp-gimplify.cc
> > +++ b/gcc/cp/cp-gimplify.cc
> > @@ -1179,11 +1179,15 @@ cp_fold_immediate_r (tree *stmt_p, int 
> > *walk_subtrees, void *data_)
> > /* No need to look into types or unevaluated operands.
> >NB: This affects cp_fold_r as well.  */
> > -  if (TYPE_P (stmt) || unevaluated_p (code) || in_immediate_context ())
> > +  if (TYPE_P (stmt) || unevaluated_p (code))
> >   {
> > *walk_subtrees = 0;
> > return NULL_TREE;
> >   }
> > +  else if (in_immediate_context ())
> > +/* Don't clear *walk_subtrees here: we still need to walk the subtrees
> > +   of SIZEOF_EXPR and similar.  */
> > +return NULL_TREE;
> > tree decl = NULL_TREE;
> > bool call_p = false;
> > diff --git a/gcc/testsuite/g++.dg/template/sizeof18.C 
> > b/gcc/testsuite/g++.dg/template/sizeof18.C
> > new file mode 100644
> > index 000..afba9946258
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/template/sizeof18.C
> > @@ -0,0 +1,8 @@
> > +// PR c++/112869
> > +// { dg-do compile }
> > +
> > +void min(long, long);
> > +template  void Binaryread(int &, T, unsigned long);
> > +template <> void Binaryread(int &, float, unsigned long bytecount) {
> > +  min(bytecount, sizeof(int));
> > +}
> 
> Hmm, actually, why does the above make a difference for this testcase?
> 
> ...
> 
> It seems that in_immediate_context always returns true in cp_fold_function
> because current_binding_level->kind == sk_template_parms.  That seems like a
> problem.  Maybe for cp_fold_immediate_r we only want to check
> cp_unevaluated_operand or DECL_IMMEDIATE_CONTEXT (current_function_decl)?

Yeah, I suppose that could become an issue.  How about this, then?

Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
-- >8 --
This test shows that we cannot clear *walk_subtrees in
cp_fold_immediate_r when we're in_immediate_context, because that,
as the comment says, affects cp_fold_r as well.  Here we had an
expression with

  min ((long int) VIEW_CONVERT_EXPR(bytecount), (long int) 
<<< Unknown tree: sizeof_expr
(int) <<< error >>> >>>)

as its sub-expression, and we never evaluated that into

  min ((long int) bytecount, 4)

so the SIZEOF_EXPR

[committed] libstdc++: Fix std::format("{}", 'c')

2023-12-12 Thread Jonathan Wakely

Tested x86_64-linux. Pushed to trunk.

-- >8--

When I added a fast path for std::format("{}", x) in
r14-5587-g41a5ea4cab2c59 I forgot to handle char separately from other
integral types. That caused std::format("{}", 'c') to return "99"
instead of "c".

libstdc++-v3/ChangeLog:

* include/std/format (__do_vformat_to): Handle char separately
from other integral types.
* testsuite/std/format/functions/format.cc: Check for expected
output for char and bool arguments.
* testsuite/std/format/string.cc: Check that 0 filling is
rejected for character and string formats.
---
 libstdc++-v3/include/std/format   |  9 +++
 .../testsuite/std/format/functions/format.cc  | 56 +++
 libstdc++-v3/testsuite/std/format/string.cc   |  3 +
 3 files changed, 68 insertions(+)

diff --git a/libstdc++-v3/include/std/format b/libstdc++-v3/include/std/format
index 04d03e0ceb7..1f8cd5c06be 100644
--- a/libstdc++-v3/include/std/format
+++ b/libstdc++-v3/include/std/format
@@ -3968,6 +3968,15 @@ namespace __format
  __done = true;
}
}
+ else if constexpr (is_same_v<_Tp, char>)
+   {
+ if (auto __res = __sink_out._M_reserve(1))
+   {
+ *__res.get() = __arg;
+ __res._M_bump(1);
+ __done = true;
+   }
+   }
  else if constexpr (is_integral_v<_Tp>)
{
  make_unsigned_t<_Tp> __uval;
diff --git a/libstdc++-v3/testsuite/std/format/functions/format.cc 
b/libstdc++-v3/testsuite/std/format/functions/format.cc
index 9328dec8875..b3b4f0647bc 100644
--- a/libstdc++-v3/testsuite/std/format/functions/format.cc
+++ b/libstdc++-v3/testsuite/std/format/functions/format.cc
@@ -256,12 +256,42 @@ test_width()
   }
 }
 
+void
+test_char()
+{
+  std::string s;
+
+  s = std::format("{}", 'a');
+  VERIFY( s == "a" );
+
+  s = std::format("{:c} {:d} {:o}", 'b', '\x17', '\x3f');
+  VERIFY( s == "b 23 77" );
+
+  s = std::format("{:#d} {:#o}", '\x17', '\x3f');
+  VERIFY( s == "23 077" );
+
+  s = std::format("{:04d} {:04o}", '\x17', '\x3f');
+  VERIFY( s == "0023 0077" );
+
+  s = std::format("{:b} {:B} {:#b} {:#B}", '\xff', '\xa0', '\x17', '\x3f');
+  if constexpr (std::is_unsigned_v)
+VERIFY( s == " 1010 0b10111 0B11" );
+  else
+VERIFY( s == "-1 -110 0b10111 0B11" );
+
+  s = std::format("{:x} {:#x} {:#X}", '\x12', '\x34', '\x45');
+  VERIFY( s == "12 0x34 0X45" );
+}
+
 void
 test_wchar()
 {
   using namespace std::literals;
   std::wstring s;
 
+  s = std::format(L"{}", L'a');
+  VERIFY( s == L"a" );
+
   s = std::format(L"{} {} {} {} {} {}", L'0', 1, 2LL, 3.4, L"five", L"six"s);
   VERIFY( s == L"0 1 2 3.4 five six" );
 
@@ -353,6 +383,9 @@ test_pointer()
   const void* pc = p;
   std::string s, str_int;
 
+  s = std::format("{}", p);
+  VERIFY( s == "0x0" );
+
   s = std::format("{} {} {}", p, pc, nullptr);
   VERIFY( s == "0x0 0x0 0x0" );
   s = std::format("{:p} {:p} {:p}", p, pc, nullptr);
@@ -385,6 +418,27 @@ test_pointer()
 #endif
 }
 
+void
+test_bool()
+{
+  std::string s;
+
+  s = std::format("{}", true);
+  VERIFY( s == "true" );
+  s = std::format("{:} {:s}", true, false);
+  VERIFY( s == "true false" );
+  s = std::format("{:b} {:#b}", true, false);
+  VERIFY( s == "1 0b0" );
+  s = std::format("{:B} {:#B}", false, true);
+  VERIFY( s == "0 0B1" );
+  s = std::format("{:d} {:#d}", false, true);
+  VERIFY( s == "0 1" );
+  s = std::format("{:o} {:#o} {:#o}", false, true, false);
+  VERIFY( s == "0 01 0" );
+  s = std::format("{:x} {:#x} {:#X}", false, true, false);
+  VERIFY( s == "0 0x1 0X0" );
+}
+
 int main()
 {
   test_no_args();
@@ -393,8 +447,10 @@ int main()
   test_alternate_forms();
   test_locale();
   test_width();
+  test_char();
   test_wchar();
   test_minmax();
   test_p1652r1();
   test_pointer();
+  test_bool();
 }
diff --git a/libstdc++-v3/testsuite/std/format/string.cc 
b/libstdc++-v3/testsuite/std/format/string.cc
index 5d338644c62..40aaebae04e 100644
--- a/libstdc++-v3/testsuite/std/format/string.cc
+++ b/libstdc++-v3/testsuite/std/format/string.cc
@@ -109,6 +109,9 @@ test_format_spec()
   VERIFY( ! is_format_string_for("{:#?}", "str") );
   VERIFY( ! is_format_string_for("{:#?}", 'c') );
 
+  VERIFY( ! is_format_string_for("{:0c}", 'c') );
+  VERIFY( ! is_format_string_for("{:0s}", true) );
+
   // Precision only valid for string and floating-point types.
   VERIFY( ! is_format_string_for("{:.3d}", 1) );
   VERIFY( ! is_format_string_for("{:3.3d}", 1) );
-- 
2.43.0

[committed] libstdc++: Fix std::format output of %C for negative years

2023-12-12 Thread Jonathan Wakely

Tested x86_64-linux. Pushed to trunk.

-- >8--

During discussion of LWG 4022 I noticed that we do not correctly
implement floored division for the century. We were just truncating
towards zero, rather than applying the floor function. For negative
values that rounds the wrong way.

libstdc++-v3/ChangeLog:

* include/bits/chrono_io.h (__formatter_chrono::_M_C_y_Y): Fix
rounding for negative centuries.
* testsuite/std/time/year/io.cc: Check %C for negative years.
---
 libstdc++-v3/include/bits/chrono_io.h  | 9 +++--
 libstdc++-v3/testsuite/std/time/year/io.cc | 7 +--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/libstdc++-v3/include/bits/chrono_io.h 
b/libstdc++-v3/include/bits/chrono_io.h
index 16e8fc58dff..b63b8592eba 100644
--- a/libstdc++-v3/include/bits/chrono_io.h
+++ b/libstdc++-v3/include/bits/chrono_io.h
@@ -820,9 +820,14 @@ namespace __format
 
  if (__conv == 'Y' || __conv == 'C')
{
- if (__is_neg)
-   __s.assign(1, _S_plus_minus[1]);
  int __ci = __yi / 100;
+ if (__is_neg) [[unlikely]]
+   {
+ __s.assign(1, _S_plus_minus[1]);
+ // For floored division -123//100 is -2 and -100//100 is -1
+ if ((__ci * 100) != __yi)
+   ++__ci;
+   }
  if (__ci >= 100) [[unlikely]]
{
  __s += std::format(_S_empty_spec, __ci / 100);
diff --git a/libstdc++-v3/testsuite/std/time/year/io.cc 
b/libstdc++-v3/testsuite/std/time/year/io.cc
index 6157afae253..a6683ae20df 100644
--- a/libstdc++-v3/testsuite/std/time/year/io.cc
+++ b/libstdc++-v3/testsuite/std/time/year/io.cc
@@ -43,8 +43,11 @@ test_format()
   s = std::format("{}", --year::min()); // formatted via ostream
   VERIFY( s == "-32768 is not a valid year" );
 
-  s = std::format("{:%y} {:%y}", 1976y, -1976y);
-  VERIFY( s == "76 76" ); // LWG 3831
+  s = std::format("{:%C %y} {:%C %y}", 1976y, -1976y);
+  VERIFY( s == "19 76 -20 76" ); // LWG 3831
+
+  s = std::format("{:%C %y} {:%C %y} {:%C %y}", -9y, -900y, -555y);
+  VERIFY( s == "-01 09 -09 00 -06 55" ); // LWG 4022
 
   s = std::format("{0:%EC}{0:%Ey} = {0:%EY}", 1642y);
   VERIFY( s == "1642 = 1642" );
-- 
2.43.0

[committed] libstdc++: Remove redundant -std flags from Makefile

2023-12-12 Thread Jonathan Wakely

Tested x86_64-linux. Pushed to trunk.

-- >8--

In r14-4060-gc4baeaecbbf7d0 I moved some files from src/c++98 to
src/c++11 but I didn't remove the redundant -std=gnu++11 flags for those
files. The flags aren't needed now, because AM_CXXFLAGS for that
directory already uses -std=gnu++11. This removes them.

libstdc++-v3/ChangeLog:

* src/c++11/Makefile.am: Remove redundant -std=gnu++11 flags.
* src/c++11/Makefile.in: Regenerate.
---
 libstdc++-v3/src/c++11/Makefile.am | 8 
 libstdc++-v3/src/c++11/Makefile.in | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/libstdc++-v3/src/c++11/Makefile.am 
b/libstdc++-v3/src/c++11/Makefile.am
index 9cddb978928..b626e477dde 100644
--- a/libstdc++-v3/src/c++11/Makefile.am
+++ b/libstdc++-v3/src/c++11/Makefile.am
@@ -159,13 +159,13 @@ limits.lo: limits.cc
 limits.o: limits.cc
$(CXXCOMPILE) -fchar8_t -c $<
 locale_init.lo: locale_init.cc
-   $(LTCXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(LTCXXCOMPILE) -fchar8_t -c $<
 locale_init.o: locale_init.cc
-   $(CXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(CXXCOMPILE) -fchar8_t -c $<
 localename.lo: localename.cc
-   $(LTCXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(LTCXXCOMPILE) -fchar8_t -c $<
 localename.o: localename.cc
-   $(CXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(CXXCOMPILE) -fchar8_t -c $<
 
 if ENABLE_DUAL_ABI
 # Rewrite the type info for __ios_failure.
diff --git a/libstdc++-v3/src/c++11/Makefile.in 
b/libstdc++-v3/src/c++11/Makefile.in
index e6d37c5464c..4be021e8025 100644
--- a/libstdc++-v3/src/c++11/Makefile.in
+++ b/libstdc++-v3/src/c++11/Makefile.in
@@ -887,13 +887,13 @@ limits.lo: limits.cc
 limits.o: limits.cc
$(CXXCOMPILE) -fchar8_t -c $<
 locale_init.lo: locale_init.cc
-   $(LTCXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(LTCXXCOMPILE) -fchar8_t -c $<
 locale_init.o: locale_init.cc
-   $(CXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(CXXCOMPILE) -fchar8_t -c $<
 localename.lo: localename.cc
-   $(LTCXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(LTCXXCOMPILE) -fchar8_t -c $<
 localename.o: localename.cc
-   $(CXXCOMPILE) -std=gnu++11 -fchar8_t -c $<
+   $(CXXCOMPILE) -fchar8_t -c $<
 
 @ENABLE_DUAL_ABI_TRUE@cxx11-ios_failure-lt.s: cxx11-ios_failure.cc
 @ENABLE_DUAL_ABI_TRUE@ $(LTCXXCOMPILE) -gno-as-loc-support -S $< -o 
tmp-cxx11-ios_failure-lt.s
-- 
2.43.0

[PATCH] btf: change encoding of forward-declared enums [PR111735]

2023-12-12 Thread David Faust

The BTF specification does not formally define a representation for
forward-declared enum types such as:

  enum Foo;

Forward-declarations for struct and union types are represented by
BTF_KIND_FWD, which has a 1-bit flag distinguishing the two.

The de-facto standard format used by other tools like clang and pahole
is to represent forward-declared enums as BTF_KIND_ENUM with vlen=0,
i.e. as a regular enum type with no enumerators.  This patch changes
GCC to adopt that format, and makes a couple of minor cleanups in
btf_asm_type ().

Bootstrapped and tested on x86_64-linux-gnu.
Also tested on x86_64-linux-gnu host for bpf-unknown-none target.

gcc/

PR debug/111735
* btfout.cc (btf_fwd_to_enum_p): New.
(btf_asm_type_ref): Special case references to enum forwards.
(btf_asm_type): Special case enum forwards. Rename btf_size_type to
btf_size, and change chained ifs switching on btf_kind into else ifs.

gcc/testsuite/

PR debug/111735
* gcc.dg/debug/btf/btf-forward-2.c: New test.
---
 gcc/btfout.cc | 46 ++-
 .../gcc.dg/debug/btf/btf-forward-2.c  | 18 
 2 files changed, 53 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/debug/btf/btf-forward-2.c

diff --git a/gcc/btfout.cc b/gcc/btfout.cc
index db4f1084f85..3ec938874b6 100644
--- a/gcc/btfout.cc
+++ b/gcc/btfout.cc
@@ -268,6 +268,17 @@ btf_emit_id_p (ctf_id_t id)
  && (btf_id_map[id] <= BTF_MAX_TYPE));
 }
 
+/* Return true if DTD is a forward-declared enum.  The BTF representation
+   of forward declared enums is not formally defined.  */
+
+static bool
+btf_fwd_to_enum_p (ctf_dtdef_ref dtd)
+{
+  uint32_t btf_kind = get_btf_kind (CTF_V2_INFO_KIND 
(dtd->dtd_data.ctti_info));
+
+  return (btf_kind == BTF_KIND_FWD && dtd->dtd_data.ctti_type == CTF_K_ENUM);
+}
+
 /* Each BTF type can be followed additional, variable-length information
completing the description of the type. Calculate the number of bytes
of variable information required to encode a given type.  */
@@ -753,8 +764,12 @@ btf_asm_type_ref (const char *prefix, ctf_container_ref 
ctfc, ctf_id_t ref_id)
   uint32_t ref_kind
= get_btf_kind (CTF_V2_INFO_KIND (ref_type->dtd_data.ctti_info));
 
+  const char *kind_name = btf_fwd_to_enum_p (ref_type)
+   ? btf_kind_name (BTF_KIND_ENUM)
+   : btf_kind_name (ref_kind);
+
   dw2_asm_output_data (4, ref_id, "%s: (BTF_KIND_%s '%s')",
-  prefix, btf_kind_name (ref_kind),
+  prefix, kind_name,
   get_btf_type_name (ref_type));
 }
 }
@@ -765,11 +780,11 @@ btf_asm_type_ref (const char *prefix, ctf_container_ref 
ctfc, ctf_id_t ref_id)
 static void
 btf_asm_type (ctf_container_ref ctfc, ctf_dtdef_ref dtd)
 {
-  uint32_t btf_kind, btf_kflag, btf_vlen, btf_size_type;
+  uint32_t btf_kind, btf_kflag, btf_vlen, btf_size;
   uint32_t ctf_info = dtd->dtd_data.ctti_info;
 
   btf_kind = get_btf_kind (CTF_V2_INFO_KIND (ctf_info));
-  btf_size_type = dtd->dtd_data.ctti_type;
+  btf_size = dtd->dtd_data.ctti_size;
   btf_vlen = CTF_V2_INFO_VLEN (ctf_info);
 
   /* By now any unrepresentable types have been removed.  */
@@ -777,7 +792,7 @@ btf_asm_type (ctf_container_ref ctfc, ctf_dtdef_ref dtd)
 
   /* Size 0 integers are redundant definitions of void. None should remain
  in the types list by this point.  */
-  gcc_assert (btf_kind != BTF_KIND_INT || btf_size_type >= 1);
+  gcc_assert (btf_kind != BTF_KIND_INT || btf_size >= 1);
 
   /* Re-encode the ctti_info to BTF.  */
   /* kflag is 1 for structs/unions with a bitfield member.
@@ -810,16 +825,26 @@ btf_asm_type (ctf_container_ref ctfc, ctf_dtdef_ref dtd)
  structs and forwards to unions. The dwarf2ctf conversion process stores
  the kind of the forward in ctti_type, but for BTF this must be 0 for
  forwards, with only the KIND_FLAG to distinguish.
- At time of writing, BTF forwards to enums are unspecified.  */
-  if (btf_kind == BTF_KIND_FWD)
+ Forwards to enum types are special-cased below.  */
+  else if (btf_kind == BTF_KIND_FWD)
 {
   if (dtd->dtd_data.ctti_type == CTF_K_UNION)
btf_kflag = 1;
 
-  btf_size_type = 0;
+  /* PR debug/111735.  Encode foward-declared enums as BTF_KIND_ENUM
+with vlen=0.  A representation for these is not formally defined;
+this is the de-facto standard used by other tools like clang
+and pahole.  */
+  else if (dtd->dtd_data.ctti_type == CTF_K_ENUM)
+   {
+ btf_kind = BTF_KIND_ENUM;
+ btf_vlen = 0;
+   }
+
+  btf_size = 0;
 }
 
-  if (btf_kind == BTF_KIND_ENUM)
+  else if (btf_kind == BTF_KIND_ENUM)
 {
   btf_kflag = dtd->dtd_enum_unsigned
? BTF_KF_ENUM_UNSIGNED
@@ -829,7 +854,7 @@ btf_asm_type (ctf_container_ref ctfc, ctf_dtdef_ref dtd)
}
 
   /* PR debug/112656.  BTF_KIND_FUNC_PROTO is

Re: [PATCH] RISC-V: Apply vla vs. vls mode heuristic vector COST model

2023-12-12 Thread Robin Dapp

Given that it's almost verbatim aarch64's implementation and the
general approach appears sensible, LGTM.

Regards
 Robin

[PATCH] c++: unifying constants vs their type [PR99186, PR104867]

2023-12-12 Thread Patrick Palka

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK
for trunk?

-- >8 --

When unifying constants we need to generally treat constants of
different types but same value as different, in light of auto template
parameters.  This patch fixes this in a minimal way; it seems we could
get away with just using template_args_equal here, as we do in the
default case, but that's a simplification we could look into during next
stage 1.

PR c++/99186
PR c++/104867

gcc/cp/ChangeLog:

* pt.cc (unify) : Compare types as well.

gcc/testsuite/ChangeLog:

* g++.dg/cpp1z/nontype-auto23.C: New test.
* g++.dg/cpp1z/nontype-auto24.C: New test.
---
 gcc/cp/pt.cc|  2 ++
 gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C | 23 +
 gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C | 18 
 3 files changed, 43 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C
 create mode 100644 gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index a8966e223f1..602dd02d29d 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -24709,6 +24709,8 @@ unify (tree tparms, tree targs, tree parm, tree arg, 
int strict,
   /* Type INTEGER_CST can come from ordinary constant template args.  */
 case INTEGER_CST:
 case REAL_CST:
+  if (!same_type_p (TREE_TYPE (parm), TREE_TYPE (arg)))
+   return unify_template_argument_mismatch (explain_p, parm, arg);
   while (CONVERT_EXPR_P (arg))
arg = TREE_OPERAND (arg, 0);
 
diff --git a/gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C 
b/gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C
new file mode 100644
index 000..467559ffdda
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp1z/nontype-auto23.C
@@ -0,0 +1,23 @@
+// PR c++/99186
+// { dg-do compile { target c++17 } }
+
+template
+struct tuple_impl : tuple_impl { };
+
+template
+struct tuple_impl { };
+
+template
+struct tuple : tuple_impl<0, T, U> { };
+
+template
+void get(const tuple_impl&);
+
+template
+struct S;
+
+int main() {
+   tuple,S<1U>> x;
+   get>(x);
+   get>(x);
+}
diff --git a/gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C 
b/gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C
new file mode 100644
index 000..52e4c134ccd
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp1z/nontype-auto24.C
@@ -0,0 +1,18 @@
+// PR c++/104867
+// { dg-do compile { target c++17 } }
+
+enum class Foo { A1 };
+
+enum class Bar { B1 };
+
+template struct enum_;
+
+template struct list { };
+
+template void f(list, V>);
+
+struct enum_type_map : list, int>, list, double> 
{ };
+
+int main() {
+  f(enum_type_map());
+}
-- 
2.43.0.76.g1a87c842ec

Re: [PATCH] c++: Fix warmth propagation for member function templates

2023-12-12 Thread Marek Polacek

On Tue, Dec 12, 2023 at 07:29:40PM +, Jason Xu wrote:
> Support was recently added for class-level warmth attributes that are
> propagated to member functions. The current implementation ignores
> member function templates and this patch fixes that.

Thanks for the patch.  Is there a bug in the Bugzilla for this?
 
> gcc/cp/ChangeLog:
> 
> * class.cc (propagate_class_warmth_attribute): fix warmth
>   propagation for member function templates

Nit, but s/fix/Fix/, and add a full stop at the end.

> ---
>  gcc/cp/class.cc | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/cp/class.cc b/gcc/cp/class.cc
> index 6fdb56abfb9..68e0f2e9e13 100644
> --- a/gcc/cp/class.cc
> +++ b/gcc/cp/class.cc
> @@ -7805,8 +7805,13 @@ propagate_class_warmth_attribute (tree t)
> 
>if (class_has_cold_attr || class_has_hot_attr)
>  for (tree f = TYPE_FIELDS (t); f; f = DECL_CHAIN (f))
> -  if (TREE_CODE (f) == FUNCTION_DECL)
> -maybe_propagate_warmth_attributes (f, t);
> +  {
> +tree real_f = f;
> +if (TREE_CODE (f) == TEMPLATE_DECL)
> +  real_f = DECL_TEMPLATE_RESULT (f);
> +if (TREE_CODE (real_f) == FUNCTION_DECL)
> +  maybe_propagate_warmth_attributes (real_f, t);
> +  }

Don't you want just:

--- a/gcc/cp/class.cc
+++ b/gcc/cp/class.cc
@@ -7805,7 +7805,7 @@ propagate_class_warmth_attribute (tree t)

   if (class_has_cold_attr || class_has_hot_attr)
 for (tree f = TYPE_FIELDS (t); f; f = DECL_CHAIN (f))
-  if (TREE_CODE (f) == FUNCTION_DECL)
+  if (TREE_CODE (STRIP_TEMPLATE (f)) == FUNCTION_DECL)
maybe_propagate_warmth_attributes (f, t);
 }


Also, can you add a test for this?

Marek

[PATCH v4 3/3] RISC-V: Add support for XCVbi extension in CV32E40P

Spec: 
github.com/openhwgroup/core-v-sw/blob/master/specifications/corev-builtin-spec.md

Contributors:
  Mary Bennett 
  Nandni Jamnadas 
  Pietra Ferreira 
  Charlie Keaney
  Jessica Mills
  Craig Blackmore 
  Simon Cook 
  Jeremy Bennett 
  Helene Chelin 

gcc/ChangeLog:
* common/config/riscv/riscv-common.cc: Create XCVbi extension
  support.
* config/riscv/riscv.opt: Likewise.
* config/riscv/corev.md: Implement cv_branch pattern
  for cv.beqimm and cv.bneimm.
* config/riscv/riscv.md: Add CORE-V branch immediate to RISC-V
  branch instruction pattern.
* config/riscv/constraints.md: Implement constraints
  cv_bi_s5 - signed 5-bit immediate.
* config/riscv/predicates.md: Implement predicate
  const_int5s_operand - signed 5 bit immediate.
* doc/sourcebuild.texi: Add XCVbi documentation.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/cv-bi-beqimm-compile-1.c: New test.
* gcc.target/riscv/cv-bi-beqimm-compile-2.c: New test.
* gcc.target/riscv/cv-bi-bneimm-compile-1.c: New test.
* gcc.target/riscv/cv-bi-bneimm-compile-2.c: New test.
* lib/target-supports.exp: Add proc for XCVbi.
---
 gcc/common/config/riscv/riscv-common.cc   |  2 +
 gcc/config/riscv/constraints.md   |  6 +++
 gcc/config/riscv/corev.md | 32 +
 gcc/config/riscv/predicates.md|  4 ++
 gcc/config/riscv/riscv.md |  2 +-
 gcc/config/riscv/riscv.opt|  2 +
 gcc/doc/sourcebuild.texi  |  3 ++
 .../gcc.target/riscv/cv-bi-beqimm-compile-1.c | 17 +++
 .../gcc.target/riscv/cv-bi-beqimm-compile-2.c | 48 +++
 .../gcc.target/riscv/cv-bi-bneimm-compile-1.c | 17 +++
 .../gcc.target/riscv/cv-bi-bneimm-compile-2.c | 48 +++
 gcc/testsuite/lib/target-supports.exp | 13 +
 12 files changed, 193 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-beqimm-compile-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-beqimm-compile-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-bneimm-compile-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-bneimm-compile-2.c

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index c8c0d0a2252..125f8fb71f7 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -313,6 +313,7 @@ static const struct riscv_ext_version 
riscv_ext_version_table[] =
   {"xcvmac", ISA_SPEC_CLASS_NONE, 1, 0},
   {"xcvalu", ISA_SPEC_CLASS_NONE, 1, 0},
   {"xcvelw", ISA_SPEC_CLASS_NONE, 1, 0},
+  {"xcvbi", ISA_SPEC_CLASS_NONE, 1, 0},
 
   {"xtheadba", ISA_SPEC_CLASS_NONE, 1, 0},
   {"xtheadbb", ISA_SPEC_CLASS_NONE, 1, 0},
@@ -1678,6 +1679,7 @@ static const riscv_ext_flag_table_t 
riscv_ext_flag_table[] =
   {"xcvmac",_options::x_riscv_xcv_subext, MASK_XCVMAC},
   {"xcvalu",_options::x_riscv_xcv_subext, MASK_XCVALU},
   {"xcvelw",_options::x_riscv_xcv_subext, MASK_XCVELW},
+  {"xcvbi", _options::x_riscv_xcv_subext, MASK_XCVBI},
 
   {"xtheadba",  _options::x_riscv_xthead_subext, MASK_XTHEADBA},
   {"xtheadbb",  _options::x_riscv_xthead_subext, MASK_XTHEADBB},
diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index 2711efe68c5..718b4bd77df 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -247,3 +247,9 @@
   (and (match_code "const_int")
(and (match_test "IN_RANGE (ival, 0, 1073741823)")
 (match_test "exact_log2 (ival + 1) != -1"
+
+(define_constraint "CV_bi_sign5"
+  "@internal
+   A 5-bit signed immediate for CORE-V Immediate Branch."
+  (and (match_code "const_int")
+   (match_test "IN_RANGE (ival, -16, 15)")))
diff --git a/gcc/config/riscv/corev.md b/gcc/config/riscv/corev.md
index 92bf0b5d6a6..92e30a8ae04 100644
--- a/gcc/config/riscv/corev.md
+++ b/gcc/config/riscv/corev.md
@@ -706,3 +706,35 @@
 
   [(set_attr "type" "load")
   (set_attr "mode" "SI")])
+
+;; XCVBI Instructions
+(define_insn "cv_branch"
+  [(set (pc)
+   (if_then_else
+(match_operator 1 "equality_operator"
+[(match_operand:X 2 "register_operand" "r")
+ (match_operand:X 3 "const_int5s_operand" 
"CV_bi_sign5")])
+(label_ref (match_operand 0 "" ""))
+(pc)))]
+  "TARGET_XCVBI"
+  "cv.b%C1imm\t%2,%3,%0"
+  [(set_attr "type" "branch")
+   (set_attr "mode" "none")])
+
+(define_insn "*branch"
+  [(set (pc)
+(if_then_else
+ (match_operator 1 "ordered_comparison_operator"
+ [(match_operand:X 2 "register_operand" "r")
+  (match_operand:X 3 "reg_or_0_operand" "rJ")])
+ (label_ref (match_operand 0 "" ""))
+ (pc)))]
+  "TARGET_XCVBI"
+{
+  if (get_attr_length

Re: [PATCH] c++: unifying FUNCTION_DECLs [PR93740]


On 12/12/23 13:40, Patrick Palka wrote:

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
OK for trunk?


OK.


I considered removing the is_overloaded_fn test now as
well, but it could in theory be hit (and not subsumed by the
type_unknown_p test) for e.g. OVERLOAD of a single FUNCTION_DECL.  I
wonder if that's something we'd see here?  If not, I can remove the
test.  It seems safe to remove as far as the testsuite is concerned.


Next stage 1, sure.


-- >8 --

unify currently always returns success when unifying two FUNCTION_DECLs
(due to the is_overloaded_fn deferment within the default case), which
means for the below testcase unify incorrectly matches ::foo with
::bar, which leads to deduction failure for the index_of calls due to
a bogus base class ambiguity.

This patch makes us instead handle unification of FUNCTION_DECL like
other decls, i.e. according to their identity.

PR c++/93740

gcc/cp/ChangeLog:

* pt.cc (unify) : Handle it like FIELD_DECL
and TEMPLATE_DECL.

gcc/testsuite/ChangeLog:

* g++.dg/template/ptrmem34.C: New test.
---
  gcc/cp/pt.cc |  1 +
  gcc/testsuite/g++.dg/template/ptrmem34.C | 27 
  2 files changed, 28 insertions(+)
  create mode 100644 gcc/testsuite/g++.dg/template/ptrmem34.C

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index c2ddbff405b..a8966e223f1 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -24967,6 +24967,7 @@ unify (tree tparms, tree targs, tree parm, tree arg, 
int strict,
gcc_unreachable ();
  
  case FIELD_DECL:

+case FUNCTION_DECL:
  case TEMPLATE_DECL:
/* Matched cases are handled by the ARG == PARM test above.  */
return unify_template_argument_mismatch (explain_p, parm, arg);
diff --git a/gcc/testsuite/g++.dg/template/ptrmem34.C 
b/gcc/testsuite/g++.dg/template/ptrmem34.C
new file mode 100644
index 000..c349ca55639
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/ptrmem34.C
@@ -0,0 +1,27 @@
+// PR c++/93740
+// { dg-do compile { target c++11 } }
+
+struct A {
+  void foo();
+  void bar();
+};
+
+template 
+struct const_val{};
+
+template 
+struct indexed_elem{};
+
+using mem_fun_A_foo = const_val;
+using mem_fun_A_bar = const_val;
+
+struct A_indexed_member_funcs
+  : indexed_elem<0, mem_fun_A_foo>,
+indexed_elem<1, mem_fun_A_bar>
+{};
+
+template 
+constexpr int index_of(indexed_elem) { return N; }
+
+static_assert(index_of(A_indexed_member_funcs{}) == 0, "");
+static_assert(index_of(A_indexed_member_funcs{}) == 1, "");

[PATCH v4 2/3] RISC-V: Update XCValu constraints to match other vendors

gcc/ChangeLog:
* config/riscv/constraints.md: CVP2 -> CV_alu_pow2.
* config/riscv/corev.md: Likewise.
---
 gcc/config/riscv/constraints.md | 15 ---
 gcc/config/riscv/corev.md   |  4 ++--
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index 68be4515c04..2711efe68c5 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -151,13 +151,6 @@
 (define_register_constraint "zmvr" "(TARGET_ZFA || TARGET_XTHEADFMV) ? GR_REGS 
: NO_REGS"
   "An integer register for  ZFA or XTheadFmv.")
 
-;; CORE-V Constraints
-(define_constraint "CVP2"
-  "Checking for CORE-V ALU clip if ival plus 1 is a power of 2"
-  (and (match_code "const_int")
-   (and (match_test "IN_RANGE (ival, 0, 1073741823)")
-(match_test "exact_log2 (ival + 1) != -1"
-
 ;; Vector constraints.
 
 (define_register_constraint "vr" "TARGET_VECTOR ? V_REGS : NO_REGS"
@@ -246,3 +239,11 @@
A MEM with a valid address for th.[l|s]*ur* instructions."
   (and (match_code "mem")
(match_test "th_memidx_legitimate_index_p (op, true)")))
+
+;; CORE-V Constraints
+(define_constraint "CV_alu_pow2"
+  "@internal
+   Checking for CORE-V ALU clip if ival plus 1 is a power of 2"
+  (and (match_code "const_int")
+   (and (match_test "IN_RANGE (ival, 0, 1073741823)")
+(match_test "exact_log2 (ival + 1) != -1"
diff --git a/gcc/config/riscv/corev.md b/gcc/config/riscv/corev.md
index c7a2ba07bcc..92bf0b5d6a6 100644
--- a/gcc/config/riscv/corev.md
+++ b/gcc/config/riscv/corev.md
@@ -516,7 +516,7 @@
 (define_insn "riscv_cv_alu_clip"
   [(set (match_operand:SI 0 "register_operand" "=r,r")
(unspec:SI [(match_operand:SI 1 "register_operand" "r,r")
-   (match_operand:SI 2 "immediate_register_operand" "CVP2,r")]
+   (match_operand:SI 2 "immediate_register_operand" 
"CV_alu_pow2,r")]
 UNSPEC_CV_ALU_CLIP))]
 
   "TARGET_XCVALU && !TARGET_64BIT"
@@ -529,7 +529,7 @@
 (define_insn "riscv_cv_alu_clipu"
   [(set (match_operand:SI 0 "register_operand" "=r,r")
(unspec:SI [(match_operand:SI 1 "register_operand" "r,r")
-   (match_operand:SI 2 "immediate_register_operand" "CVP2,r")]
+   (match_operand:SI 2 "immediate_register_operand" 
"CV_alu_pow2,r")]
 UNSPEC_CV_ALU_CLIPU))]
 
   "TARGET_XCVALU && !TARGET_64BIT"
-- 
2.34.1

[PATCH v4 0/3] RISC-V: Support CORE-V XCVELW and XCVBI extensions

Thank you for reviewing my patches!

v1 -> v2:
  * Bring the MEM into the operand for cv.elw. The new predicate is
move_operand.
  * Add comment to riscv.md detailing why corev.md must appear before
the generic riscv instructions.

v2 -> v3:
  * Merge patterns for CORE-V branch immediate and generic RISC-V so to
supress the generic patterns if XCVbi is available.

v3 -> v4:
  * Add duplicate content of "*branch" to corev.md.

This patch series presents the comprehensive implementation of the ELW and BI
extension for CORE-V.

Tested with riscv-gnu-toolchain on binutils, ld, gas and gcc testsuites to
ensure its correctness and compatibility with the existing codebase.
However, your input, reviews, and suggestions are invaluable in making this
extension even more robust.

The CORE-V builtins are described in the specification [1] and work can be
found in the OpenHW group's Github repository [2].

[1] 
github.com/openhwgroup/core-v-sw/blob/master/specifications/corev-builtin-spec.md

[2] github.com/openhwgroup/corev-gcc

Contributors:
  Mary Bennett 
  Nandni Jamnadas 
  Pietra Ferreira 
  Charlie Keaney
  Jessica Mills
  Craig Blackmore 
  Simon Cook 
  Jeremy Bennett 
  Helene Chelin 

RISC-V: Update XCValu constraints to match other vendors
RISC-V: Add support for XCVelw extension in CV32E40P
RISC-V: Add support for XCVbi extension in CV32E40P

 gcc/common/config/riscv/riscv-common.cc   |  4 ++
 gcc/config/riscv/constraints.md   | 21 +---
 gcc/config/riscv/corev.def|  3 ++
 gcc/config/riscv/corev.md | 51 ++-
 gcc/config/riscv/predicates.md|  4 ++
 gcc/config/riscv/riscv-builtins.cc|  2 +
 gcc/config/riscv/riscv-ftypes.def |  1 +
 gcc/config/riscv/riscv.md |  2 +-
 gcc/config/riscv/riscv.opt|  4 ++
 gcc/doc/extend.texi   |  8 +++
 gcc/doc/sourcebuild.texi  |  6 +++
 .../gcc.target/riscv/cv-bi-beqimm-compile-1.c | 17 +++
 .../gcc.target/riscv/cv-bi-beqimm-compile-2.c | 48 +
 .../gcc.target/riscv/cv-bi-bneimm-compile-1.c | 17 +++
 .../gcc.target/riscv/cv-bi-bneimm-compile-2.c | 48 +
 .../gcc.target/riscv/cv-elw-elw-compile-1.c   | 11 
 gcc/testsuite/lib/target-supports.exp | 26 ++
 17 files changed, 263 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-beqimm-compile-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-beqimm-compile-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-bneimm-compile-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-bi-bneimm-compile-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-elw-elw-compile-1.c

-- 
2.34.1

[PATCH v4 1/3] RISC-V: Add support for XCVelw extension in CV32E40P

Spec: 
github.com/openhwgroup/core-v-sw/blob/master/specifications/corev-builtin-spec.md

Contributors:
  Mary Bennett 
  Nandni Jamnadas 
  Pietra Ferreira 
  Charlie Keaney
  Jessica Mills
  Craig Blackmore 
  Simon Cook 
  Jeremy Bennett 
  Helene Chelin 

gcc/ChangeLog:
* common/config/riscv/riscv-common.cc: Add XCVelw.
* config/riscv/corev.def: Likewise.
* config/riscv/corev.md: Likewise.
* config/riscv/riscv-builtins.cc (AVAIL): Likewise.
* config/riscv/riscv-ftypes.def: Likewise.
* config/riscv/riscv.opt: Likewise.
* doc/extend.texi: Add XCVelw builtin documentation.
* doc/sourcebuild.texi: Likewise.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/cv-elw-compile-1.c: Create test for cv.elw.
* testsuite/lib/target-supports.exp: Add proc for the XCVelw extension.
---
 gcc/common/config/riscv/riscv-common.cc   |  2 ++
 gcc/config/riscv/corev.def|  3 +++
 gcc/config/riscv/corev.md | 15 +++
 gcc/config/riscv/riscv-builtins.cc|  2 ++
 gcc/config/riscv/riscv-ftypes.def |  1 +
 gcc/config/riscv/riscv.opt|  2 ++
 gcc/doc/extend.texi   |  8 
 gcc/doc/sourcebuild.texi  |  3 +++
 .../gcc.target/riscv/cv-elw-elw-compile-1.c   | 11 +++
 gcc/testsuite/lib/target-supports.exp | 13 +
 10 files changed, 60 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/cv-elw-elw-compile-1.c

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 5111626157b..c8c0d0a2252 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -312,6 +312,7 @@ static const struct riscv_ext_version 
riscv_ext_version_table[] =
 
   {"xcvmac", ISA_SPEC_CLASS_NONE, 1, 0},
   {"xcvalu", ISA_SPEC_CLASS_NONE, 1, 0},
+  {"xcvelw", ISA_SPEC_CLASS_NONE, 1, 0},
 
   {"xtheadba", ISA_SPEC_CLASS_NONE, 1, 0},
   {"xtheadbb", ISA_SPEC_CLASS_NONE, 1, 0},
@@ -1676,6 +1677,7 @@ static const riscv_ext_flag_table_t 
riscv_ext_flag_table[] =
 
   {"xcvmac",_options::x_riscv_xcv_subext, MASK_XCVMAC},
   {"xcvalu",_options::x_riscv_xcv_subext, MASK_XCVALU},
+  {"xcvelw",_options::x_riscv_xcv_subext, MASK_XCVELW},
 
   {"xtheadba",  _options::x_riscv_xthead_subext, MASK_XTHEADBA},
   {"xtheadbb",  _options::x_riscv_xthead_subext, MASK_XTHEADBB},
diff --git a/gcc/config/riscv/corev.def b/gcc/config/riscv/corev.def
index 17580df3c41..3b9ec029d06 100644
--- a/gcc/config/riscv/corev.def
+++ b/gcc/config/riscv/corev.def
@@ -41,3 +41,6 @@ RISCV_BUILTIN (cv_alu_subN, "cv_alu_subN", 
RISCV_BUILTIN_DIRECT, RISCV_SI_FT
 RISCV_BUILTIN (cv_alu_subuN,"cv_alu_subuN", RISCV_BUILTIN_DIRECT, 
RISCV_USI_FTYPE_USI_USI_UQI,  cvalu),
 RISCV_BUILTIN (cv_alu_subRN,"cv_alu_subRN", RISCV_BUILTIN_DIRECT, 
RISCV_SI_FTYPE_SI_SI_UQI, cvalu),
 RISCV_BUILTIN (cv_alu_subuRN,   "cv_alu_subuRN",RISCV_BUILTIN_DIRECT, 
RISCV_USI_FTYPE_USI_USI_UQI,  cvalu),
+
+// XCVELW
+RISCV_BUILTIN (cv_elw_elw_si, "cv_elw_elw", RISCV_BUILTIN_DIRECT, 
RISCV_USI_FTYPE_VOID_PTR, cvelw),
diff --git a/gcc/config/riscv/corev.md b/gcc/config/riscv/corev.md
index 1350bd4b81e..c7a2ba07bcc 100644
--- a/gcc/config/riscv/corev.md
+++ b/gcc/config/riscv/corev.md
@@ -24,6 +24,9 @@
   UNSPEC_CV_ALU_CLIPR
   UNSPEC_CV_ALU_CLIPU
   UNSPEC_CV_ALU_CLIPUR
+
+  ;;CORE-V EVENT LOAD
+  UNSPECV_CV_ELW
 ])
 
 ;; XCVMAC extension.
@@ -691,3 +694,15 @@
   cv.suburnr\t%0,%2,%3"
   [(set_attr "type" "arith")
   (set_attr "mode" "SI")])
+
+;; XCVELW builtins
+(define_insn "riscv_cv_elw_elw_si"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+   (unspec_volatile [(match_operand:SI 1 "move_operand" "p")]
+ UNSPECV_CV_ELW))]
+
+  "TARGET_XCVELW && !TARGET_64BIT"
+  "cv.elw\t%0,%a1"
+
+  [(set_attr "type" "load")
+  (set_attr "mode" "SI")])
diff --git a/gcc/config/riscv/riscv-builtins.cc 
b/gcc/config/riscv/riscv-builtins.cc
index fc3976f3ba1..5ee11ebe3bc 100644
--- a/gcc/config/riscv/riscv-builtins.cc
+++ b/gcc/config/riscv/riscv-builtins.cc
@@ -128,6 +128,7 @@ AVAIL (hint_pause, (!0))
 // CORE-V AVAIL
 AVAIL (cvmac, TARGET_XCVMAC && !TARGET_64BIT)
 AVAIL (cvalu, TARGET_XCVALU && !TARGET_64BIT)
+AVAIL (cvelw, TARGET_XCVELW && !TARGET_64BIT)
 
 /* Construct a riscv_builtin_description from the given arguments.
 
@@ -168,6 +169,7 @@ AVAIL (cvalu, TARGET_XCVALU && !TARGET_64BIT)
 #define RISCV_ATYPE_HI intHI_type_node
 #define RISCV_ATYPE_SI intSI_type_node
 #define RISCV_ATYPE_VOID_PTR ptr_type_node
+#define RISCV_ATYPE_INT_PTR integer_ptr_type_node
 
 /* RISCV_FTYPE_ATYPESN takes N RISCV_FTYPES-like type codes and lists
their associated RISCV_ATYPEs.  */
diff --git a/gcc/config/riscv/riscv-ftypes.def 
b/gcc/config/riscv/riscv-ftypes.def
index 0d1e4dd061e..3e7d5c69503 100644
---

[PATCH] c++: Fix warmth propagation for member function templates

2023-12-12 Thread Jason Xu

Support was recently added for class-level warmth attributes that are
propagated to member functions. The current implementation ignores
member function templates and this patch fixes that.

gcc/cp/ChangeLog:

* class.cc (propagate_class_warmth_attribute): fix warmth
  propagation for member function templates
---
 gcc/cp/class.cc | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/gcc/cp/class.cc b/gcc/cp/class.cc
index 6fdb56abfb9..68e0f2e9e13 100644
--- a/gcc/cp/class.cc
+++ b/gcc/cp/class.cc
@@ -7805,8 +7805,13 @@ propagate_class_warmth_attribute (tree t)

   if (class_has_cold_attr || class_has_hot_attr)
 for (tree f = TYPE_FIELDS (t); f; f = DECL_CHAIN (f))
-  if (TREE_CODE (f) == FUNCTION_DECL)
-maybe_propagate_warmth_attributes (f, t);
+  {
+tree real_f = f;
+if (TREE_CODE (f) == TEMPLATE_DECL)
+  real_f = DECL_TEMPLATE_RESULT (f);
+if (TREE_CODE (real_f) == FUNCTION_DECL)
+  maybe_propagate_warmth_attributes (real_f, t);
+  }
 }

 tree
--
2.40.0
This e-mail and any attachments may contain information that is confidential 
and proprietary and otherwise protected from disclosure. If you are not the 
intended recipient of this e-mail, do not read, duplicate or redistribute it by 
any means. Please immediately delete it and any attachments and notify the 
sender that you have received it by mistake. Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail or any 
attachments. The DRW Companies make no representations that this e-mail or any 
attachments are free of computer viruses or other defects.

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)




> Am 12.12.2023 um 19:51 schrieb Peter Bergner :
> 
> On 12/12/23 12:45 PM, Peter Bergner wrote:
>> +/* PR target/112822 */
> 
> Oops, this should be:
> 
> /* PR tree-optimization/112822 */
> 
> It's fixed on my end.

Ok

Richard 

> Peter
> 
> 
> 
>

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)

On 12/12/23 12:45 PM, Peter Bergner wrote:
> +/* PR target/112822 */

Oops, this should be:

/* PR tree-optimization/112822 */

It's fixed on my end.

Peter

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)

On 12/12/23 10:50 AM, Martin Jambor wrote:
> The testcase has reasonable size but it is specific to ppc64le and its
> altivec vectors.  My plan is to ask the bug reporter to massage it into
> a target specific testcase in bugzilla.  Alternatively I can try to
> craft a testcase from scratch but that will take time.

I rewrote the Altivec specific part of the testcase to use generic C code
and it still ICEs for me on ppc64le using an unpatched compiler.  Therefore,
I think we can just add the updated testcase to the generic g++ tests. 

I'll note I was wrong in the bugzilla comments, -O3 -mcpu=power10 is not
required to hit the ICE.  A simple -O2 on ppc64le is enough to hit the ICE.

Ok for trunk?

Peter


testsuite: Add testcase for already fixed PR [PR112822]

gcc/testsuite/
PR tree-optimization/112822
* g++.dg/pr112822.C: New test.

diff --git a/gcc/testsuite/g++.dg/pr112822.C b/gcc/testsuite/g++.dg/pr112822.C
new file mode 100644
index 000..3921d5c1bbe
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr112822.C
@@ -0,0 +1,369 @@
+/* PR target/112822 */
+/* { dg-options "-w -O2" } */
+
+/* Verify we do not ICE on the following noisy creduced test case.  */
+
+namespace b {
+typedef int c;
+template  struct d;
+template  struct d { using f = e; };
+template  struct aa;
+template  struct aa { using f = h; };
+template  using ab = typename d::f;
+template  using n = typename aa::f;
+template  class af {
+public:
+  typedef __complex__ ah;
+  template  af operator+=(e) {
+ah o;
+x = o;
+return *this;
+  }
+  ah x;
+};
+} // namespace b
+namespace {
+enum { p };
+enum { ac, ad };
+struct ae;
+struct al;
+struct ag;
+typedef b::c an;
+namespace ai {
+template  struct ak { typedef aj f; };
+template  using ar = typename ak::f;
+template  struct am {
+  enum { at };
+};
+template  struct ao {
+  enum { at };
+};
+template  struct ap;
+template  struct aq {
+  enum { at };
+};
+} // namespace ai
+template  struct ay;
+template  class as;
+template  class ba;
+template  class aw;
+template  class be;
+template  class az;
+namespace ai {
+template  struct bg;
+template ::bd>
+struct bk;
+template  struct bf;
+template  struct bm;
+template  struct bh;
+template ::bj>::at> struct bp {
+  typedef bi f;
+};
+template  struct br {
+  typedef typename bp::f>::f f;
+};
+template  struct bn;
+template  struct bn {
+  typedef aw f;
+};
+template  struct bx {
+  typedef typename bn::bs, aj ::bo>::f f;
+};
+template  struct bt { typedef b::n<0, aj, aj> f; };
+template ::f> struct cb {
+  enum { bw };
+  typedef b::n::f> f;
+};
+template ::bs> struct by {
+  typedef be f;
+};
+template  struct bz {
+  typedef typename by::f f;
+};
+template  struct ch;
+template  struct ch { typedef ci bd; };
+} // namespace ai
+template > struct cg;
+template  struct cg { typedef aj cn; };
+namespace ai {
+template  cj cp;
+template  void cl(bu *cr, cj cs) { ct(cr, cs); }
+typedef __attribute__((altivec(vector__))) double co;
+void ct(double *cr, co cs) { *(co *)cr = cs; }
+struct cq {
+  co q;
+};
+template <> struct bm> { typedef cq f; };
+template <> struct bh { typedef cq bj; };
+void ct(b::af *cr, cq cs) { ct((double *)cr, cs.q); }
+template  struct cx {
+  template  void cu(cw *a, cj) {
+cl(a, cp);
+  }
+};
+} // namespace ai
+template  class ba : public ay {
+public:
+  typedef ai::ap bu;
+  typedef b::n::bo, bu, b::n::at, bu, bu>> cv;
+  typedef ay db;
+  db::dc;
+  cv coeff(an dd, an col) const { return dc().coeff(dd, col); }
+};
+template  class cz : public ba::at> {
+public:
+  ai::ap b;
+  enum { da, dg, dh, bv, bq, di = dg, bo };
+};
+template  class be : public cz {
+public:
+  typedef typename ai::ap::bu bu;
+  typedef cz db;
+  db::dc;
+  template  cd +=(const be &);
+  template  az df(de);
+};
+template  struct ay {
+  cd () { return *static_cast(this); }
+  cd dc() const;
+};
+template  class dl;
+namespace ai {
+template  struct ap> {
+  typedef bb dj;
+  typedef bc r;
+  typedef ap s;
+  typedef ap t;
+  typedef typename cg::cn bu;
+  typedef typename ch::bd>::bd cf;
+  enum { bo };
+};
+} // namespace ai
+template 
+class az : public dl, ai::ap, ai::bg::bd>> {
+public:
+  typedef dk bb;
+  typedef Rhs_ bc;
+  typedef typename ai::bt::f LhsNested;
+  typedef typename ai::bt::f dn;
+  typedef ai::ar u;
+  typedef ai::ar RhsNestedCleaned;
+  u lhs();
+  RhsNestedCleaned rhs();
+};
+template 
+class dl : public ai::bz, al>::f {};
+namespace ai {
+template  struct v { typedef ag w; };
+template  struct evaluator_traits_base {
+  typedef typename v::cf>::w w;
+};
+template  struct ax : evaluator_traits_base {};
+template  struct y { static const bool at = false; };
+template  class plainobjectbase_evaluator_data {
+public:
+  plainobjectbase_evaluator_data(bu *ptr, an) : data(ptr) {}
+  an outerStride() { return z; }
+  bu *data;
+};
+template  struct evaluator {
+  typedef cd PlainObjectType;
+  typedef typename PlainObjectType::bu bu;
+  enum { IsVectorAtCompileTime };
+  enum {

[PATCH pushed] LoongArch: testsuite: Remove XFAIL in vect-ftint-no-inexact.c

After r14-6455 this no longer fails.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/vect-ftint-no-inexact.c (xfail): Remove.
---

Tested on loongarch64-linux-gnu.  Pushed as obvious.

 gcc/testsuite/gcc.target/loongarch/vect-ftint-no-inexact.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.target/loongarch/vect-ftint-no-inexact.c 
b/gcc/testsuite/gcc.target/loongarch/vect-ftint-no-inexact.c
index 83d268099ac..61918beef5c 100644
--- a/gcc/testsuite/gcc.target/loongarch/vect-ftint-no-inexact.c
+++ b/gcc/testsuite/gcc.target/loongarch/vect-ftint-no-inexact.c
@@ -39,6 +39,5 @@
 /* { dg-final { scan-assembler-not "\txvftintrne\.w\.s" } } */
 /* { dg-final { scan-assembler-not "\txvftintrne\.l\.d" } } */
 
-/* trunc: XFAIL due to PR 107723 */
-/* { dg-final { scan-assembler "bl\t%plt\\(trunc\\)" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler "bl\t%plt\\(trunc\\)" } } */
 /* { dg-final { scan-assembler "bl\t%plt\\(truncf\\)" } } */
-- 
2.43.0

[PATCH] c++: unifying FUNCTION_DECLs [PR93740]

2023-12-12 Thread Patrick Palka

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look
OK for trunk?  I considered removing the is_overloaded_fn test now as
well, but it could in theory be hit (and not subsumed by the
type_unknown_p test) for e.g. OVERLOAD of a single FUNCTION_DECL.  I
wonder if that's something we'd see here?  If not, I can remove the
test.  It seems safe to remove as far as the testsuite is concerned.

-- >8 --

unify currently always returns success when unifying two FUNCTION_DECLs
(due to the is_overloaded_fn deferment within the default case), which
means for the below testcase unify incorrectly matches ::foo with
::bar, which leads to deduction failure for the index_of calls due to
a bogus base class ambiguity.

This patch makes us instead handle unification of FUNCTION_DECL like
other decls, i.e. according to their identity.

PR c++/93740

gcc/cp/ChangeLog:

* pt.cc (unify) : Handle it like FIELD_DECL
and TEMPLATE_DECL.

gcc/testsuite/ChangeLog:

* g++.dg/template/ptrmem34.C: New test.
---
 gcc/cp/pt.cc |  1 +
 gcc/testsuite/g++.dg/template/ptrmem34.C | 27 
 2 files changed, 28 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/template/ptrmem34.C

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index c2ddbff405b..a8966e223f1 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -24967,6 +24967,7 @@ unify (tree tparms, tree targs, tree parm, tree arg, 
int strict,
   gcc_unreachable ();
 
 case FIELD_DECL:
+case FUNCTION_DECL:
 case TEMPLATE_DECL:
   /* Matched cases are handled by the ARG == PARM test above.  */
   return unify_template_argument_mismatch (explain_p, parm, arg);
diff --git a/gcc/testsuite/g++.dg/template/ptrmem34.C 
b/gcc/testsuite/g++.dg/template/ptrmem34.C
new file mode 100644
index 000..c349ca55639
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/ptrmem34.C
@@ -0,0 +1,27 @@
+// PR c++/93740
+// { dg-do compile { target c++11 } }
+
+struct A {
+  void foo();
+  void bar();
+};
+
+template 
+struct const_val{};
+
+template 
+struct indexed_elem{};
+
+using mem_fun_A_foo = const_val;
+using mem_fun_A_bar = const_val;
+
+struct A_indexed_member_funcs
+  : indexed_elem<0, mem_fun_A_foo>,
+indexed_elem<1, mem_fun_A_bar>
+{};
+
+template 
+constexpr int index_of(indexed_elem) { return N; }
+
+static_assert(index_of(A_indexed_member_funcs{}) == 0, "");
+static_assert(index_of(A_indexed_member_funcs{}) == 1, "");
-- 
2.43.0.76.g1a87c842ec

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
> > > I guess here the problem is floating-point compare instruction is much
> > > more costly than other instructions but the fact is not correctly
> > > modeled yet.  Could you try
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> > > where I've raised fp_add cost (which is used for estimating floating-
> > > point compare cost) to 5 instructions and see if it solves your problem
> > > without LOGICAL_OP_NON_SHORT_CIRCUIT?
> > I think this is not the same issue as the cost of floating-point 
> > comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT 
> > affects how the short-circuit branch, such as (A AND-IF B), is executed, 
> > and it is not directly related to the cost of floating-point comparison 
> > instructions. I will try to test it using SPECCPU 2017.
> 
> The point is if the cost of floating-point comparison is very high, the
> middle end *should* short cut floating-point comparisons even if
> LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
> 
> I've created https://gcc.gnu.org/PR112985.
> 
> Another factor regressing the code is we don't have modeled movcf2gr
> instruction yet, so we are not really eliding the branches as
> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.

I made up this:

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index a5d0dcd65fe..84d828ebd0f 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -3169,6 +3169,42 @@ (define_insn "s__using_FCCmode"
   [(set_attr "type" "fcmp")
(set_attr "mode" "FCC")])
 
+(define_insn "movcf2gr"
+  [(set (match_operand:GPR 0 "register_operand" "=r")
+   (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
+ (const_int 0))
+ (const_int 1)
+ (const_int 0)))]
+  "TARGET_HARD_FLOAT"
+  "movcf2gr\t%0,%1"
+  [(set_attr "type" "move")
+   (set_attr "mode" "FCC")])
+
+(define_expand "cstore4"
+  [(set (match_operand:SI 0 "register_operand")
+   (match_operator:SI 1 "loongarch_fcmp_operator"
+ [(match_operand:ANYF 2 "register_operand")
+  (match_operand:ANYF 3 "register_operand")]))]
+  ""
+  {
+rtx fcc = gen_reg_rtx (FCCmode);
+rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
+ operands[2], operands[3]);
+
+emit_insn (gen_rtx_SET (fcc, cmp));
+if (TARGET_64BIT)
+  {
+   rtx gpr = gen_reg_rtx (DImode);
+   emit_insn (gen_movcf2grdi (gpr, fcc));
+   emit_insn (gen_rtx_SET (operands[0],
+   lowpart_subreg (SImode, gpr, DImode)));
+  }
+else
+  emit_insn (gen_movcf2grsi (operands[0], fcc));
+
+DONE;
+  })
+
 

 ;;
 ;;  
diff --git a/gcc/config/loongarch/predicates.md 
b/gcc/config/loongarch/predicates.md
index 9e9ce58cb53..83fea08315c 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -590,6 +590,10 @@ (define_predicate "order_operator"
 (define_predicate "loongarch_cstore_operator"
   (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
 
+(define_predicate "loongarch_fcmp_operator"
+  (match_code
+"unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
+
 (define_predicate "small_data_pattern"
   (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
(match_test "loongarch_small_data_pattern_p (op)")))

and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
= 1):

fld.s   $f1,$r4,0
fld.s   $f0,$r4,4
fld.s   $f3,$r4,8
fld.s   $f2,$r4,12
fcmp.slt.s  $fcc1,$f0,$f3
fcmp.sgt.s  $fcc0,$f1,$f2
movcf2gr$r13,$fcc1
movcf2gr$r12,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L3
fld.s   $f4,$r4,16
fld.s   $f5,$r4,20
or  $r4,$r0,$r0
fcmp.sgt.s  $fcc1,$f1,$f5
fcmp.slt.s  $fcc0,$f0,$f4
movcf2gr$r12,$fcc1
movcf2gr$r13,$fcc0
or  $r12,$r12,$r13
bnez$r12,.L2
fcmp.sgt.s  $fcc1,$f3,$f5
fcmp.slt.s  $fcc0,$f2,$f4
movcf2gr$r4,$fcc1
movcf2gr$r12,$fcc0
or  $r4,$r4,$r12
xori$r4,$r4,1
slli.w  $r4,$r4,0
jr  $r1
.align  4
.L3:
or  $r4,$r0,$r0
.align  4
.L2:
jr  $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

Re: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-12 Thread Richard Earnshaw





On 30/11/2023 12:55, Stamatis Markianos-Wright wrote:

Hi Andre,

Thanks for the comments, see latest revision attached.

On 27/11/2023 12:47, Andre Vieira (lists) wrote:

Hi Stam,

Just some comments.

+/* Recursively scan through the DF chain backwards within the basic 
block and
+   determine if any of the USEs of the original insn (or the USEs of 
the insns
s/Recursively scan/Scan/ as you no longer recurse, thanks for that by 
the way :) +   where thy were DEF-ed, etc., recursively) were affected 
by implicit VPT

remove recursively for the same reasons.

+  if (!CONST_INT_P (cond_counter_iv.step) || !CONST_INT_P 
(cond_temp_iv.step))

+    return NULL;
+  /* Look at the steps and swap around the rtx's if needed. Error 
out if

+ one of them cannot be identified as constant.  */
+  if (INTVAL (cond_counter_iv.step) != 0 && INTVAL 
(cond_temp_iv.step) != 0)

+    return NULL;

Move the comment above the if before, as the erroring out it talks 
about is there.

Done


+  emit_note_after ((enum insn_note)NOTE_KIND (insn), BB_END (body));
 space after 'insn_note)'

@@ -173,14 +176,14 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 -  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
   || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !CONST_INT_P (XEXP (inc_src, 1)))

Do we ever check that inc_src is negative? We used to check if it was 
-1, now we only check it's a constnat, but not a negative one, so I 
suspect this needs a:

|| INTVAL (XEXP (inc_src, 1)) >= 0

Good point. Done


@@ -492,7 +519,8 @@ doloop_modify (class loop *loop, class niter_desc 
*desc,

 case GE:
   /* Currently only GE tests against zero are supported.  */
   gcc_assert (XEXP (condition, 1) == const0_rtx);
-
+  /* FALLTHRU */
+    case GTU:
   noloop = constm1_rtx;

I spent a very long time staring at this trying to understand why 
noloop = constm1_rtx for GTU, where I thought it should've been (count 
& (n-1)). For the current use of doloop it doesn't matter because ARM 
is the only target using it and you set desc->noloop_assumptions to 
null_rtx in 'arm_attempt_dlstp_transform' so noloop is never used. 
However, if a different target accepts this GTU pattern then this 
target agnostic code will do the wrong thing.  I suggest we either:
 - set noloop to what we think might be the correct value, which if 
you ask me should be 'count & (XEXP (condition, 1))',
 - or add a gcc_assert (GET_CODE (condition) != GTU); under the if 
(desc->noloop_assumption); part and document why.  I have a slight 
preference for the assert given otherwise we are adding code that we 
can't test.


Yea, that's true tbh. I've done the latter, but also separated out the 
"case GTU:" and added a comment, so that it's more clear that the noloop 
things aren't used in the only implemented GTU case (Arm)


Thank you :)



LGTM otherwise (but I don't have the power to approve this ;)).

Kind regards,
Andre

From: Stamatis Markianos-Wright 
Sent: Thursday, November 16, 2023 11:36 AM
To: Stamatis Markianos-Wright via Gcc-patches; Richard Earnshaw; 
Richard Sandiford; Kyrylo Tkachov
Subject: [PING][PATCH 2/2] arm: Add support for MVE Tail-Predicated 
Low Overhead Loops


Pinging back to the top of reviewers' inboxes due to worry about Stage 1
End in a few days :)


See the last email for the latest version of the 2/2 patch. The 1/2
patch is A-Ok from Kyrill's earlier target-backend review.


On 10/11/2023 12:41, Stamatis Markianos-Wright wrote:


On 06/11/2023 17:29, Stamatis Markianos-Wright wrote:


On 06/11/2023 11:24, Richard Sandiford wrote:

Stamatis Markianos-Wright  writes:
One of the main reasons for reading the arm bits was to try to 
answer

the question: if we switch to a downcounting loop with a GE
condition,
how do we make sure that the start value is not a large unsigned
number that is interpreted as negative by GE?  E.g. if the loop
originally counted up in steps of N and used an LTU condition,
it could stop at a value in the range [INT_MAX + 1, UINT_MAX].
But the loop might never iterate if we start counting down from
most values in that range.

Does the patch handle that?

So AFAICT this is actually handled in the generic code in
`doloop_valid_p`:

This kind of loops fail because of they are "desc->infinite", then no
loop-doloop conversion is attempted at all (even for standard
dls/le loops)

Thanks to that check I haven't been able to trigger anything like the
behaviour you describe, do you think the doloop_valid_p checks are
robust enough?

The loops I was thinking of are provably not infinite though. E.g.:

   for (unsigned int i

Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]


On 12/12/23 10:24, Jason Merrill wrote:

On 12/12/23 06:15, Jakub Jelinek wrote:

On Tue, Dec 12, 2023 at 02:13:43PM +0300, Alexander Monakov wrote:



On Tue, 12 Dec 2023, Jakub Jelinek wrote:


On Mon, Dec 11, 2023 at 05:00:50PM -0500, Jason Merrill wrote:
In discussion of PR71093 it came up that more clobber_kind options 
would be

useful within the C++ front-end.

gcc/ChangeLog:

* tree-core.h (enum clobber_kind): Rename CLOBBER_EOL to
CLOBBER_STORAGE_END.  Add CLOBBER_STORAGE_BEGIN,
CLOBBER_OBJECT_BEGIN, CLOBBER_OBJECT_END.
* gimple-lower-bitint.cc
* gimple-ssa-warn-access.cc
* gimplify.cc
* tree-inline.cc
* tree-ssa-ccp.cc: Adjust for rename.


Doesn't build_clobber_this in the C++ front-end need to be adjusted too?
I think it is used to place clobbers at start of the ctor (should be
CLOBBER_OBJECT_BEGIN in the new nomenclature) and end of the dtor (i.e.
CLOBBER_OBJECT_END).


You're right.


I had been thinking to leave that to Nathaniel's patch, but sure, I'll 
hoist those bits out:


I've now pushed this version of the patch; Nathaniel, do you want to 
rebase on it?


Jason

[pushed] testsuite: fix is_nothrow_default_constructible8.C

Tested x86_64-pc-linux-gnu, applying to trunk.

-- 8< --

This testcase uses variable templates, a C++14 feature.

gcc/testsuite/ChangeLog:

* g++.dg/ext/is_nothrow_constructible8.C: Require C++14.
---
 gcc/testsuite/g++.dg/ext/is_nothrow_constructible8.C | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/g++.dg/ext/is_nothrow_constructible8.C 
b/gcc/testsuite/g++.dg/ext/is_nothrow_constructible8.C
index c2a0b93ae97..996f6d895ff 100644
--- a/gcc/testsuite/g++.dg/ext/is_nothrow_constructible8.C
+++ b/gcc/testsuite/g++.dg/ext/is_nothrow_constructible8.C
@@ -1,4 +1,4 @@
-// { dg-do compile { target c++11 } }
+// { dg-do compile { target c++14 } }
 // PR c++/96090
 
 template 

base-commit: 321477fc3a0f8de18c4452f431309f896ae3a854
-- 
2.39.3

Re: [V2] New pass for sign/zero extension elimination -- not ready for "final" review

2023-12-12 Thread Jeff Law





On 11/29/23 21:10, Joern Rennecke wrote:

  I originally computed mmask in carry_backpropagate from XEXP (x, 0),
but abandoned that when I realized we also get called for RTX_OBJ
things.  I forgot to adjust the SIGN_EXTEND code, though.  Fixed
in the attached revised patch.  Also made sure to not make inputs
of LSHIFTRT / ASHIFTRT live if the output is dead (and commened
the checks for (mask == 0) in the process).

Something that could be done to futher simplif the code is to make
carry_backpropagate do all the rtx_code-dependent propagation
decisions.  I.e. would have cases for RTX_OBJ, AND, OR, IOR etc
that propagate the mask, and the default action would be to make
the input live (after the check not make any bits in the input
live if the output is dead).

Then we wouldn't need safe_for_live_propagation any more.

Not sure if carry_backpropagate would then still be a suitable name
anymore, though.


tmp.txt

 * ext-dce.cc (carry_backpropagate): Always return 0 when output is dead.  
Fix SIGN_EXTEND input mask.

 * ext-dce.cc: handle vector modes.
 
 * ext-dce.cc: Amend comment to explain how liveness of vectors is tracked.

   (carry_backpropagate): Use GET_MODE_INNER.
   (ext_dce_process_sets): Likewise.  Only apply big endian correction for
   subregs if they don't have a vector mode.
   (ext_cde_process_uses): Likewise.

 * ext-dce.cc: carry_backpropagate: [US]S_ASHIFT fix, handle [LA]SHIFTRT
 
 * ext-dce.cc (safe_for_live_propagation): Add LSHIFTRT and ASHIFTRT.

   (carry_backpropagate): Reformat top comment.
   Add handling of LSHIFTRT and ASHIFTRT.
   Fix bit count for [SU]MUL_HIGHPART.
   Fix pasto for [SU]S_ASHIFT.

 * ext-dce.c: Fixes for carry handling.
 
 * ext-dce.c (safe_for_live_propagation): Handle MINUS.

   (ext_dce_process_uses): Break out carry handling into ..
   (carry_backpropagate): This new function.
   Better handling of ASHIFT.
   Add handling of SMUL_HIGHPART, UMUL_HIGHPART, SIGN_EXTEND, SS_ASHIFT and
   US_ASHIFT.

 * ext-dce.c: fix SUBREG_BYTE test
I haven't done an update in a little while.  My tester spun this without 
the vector bits which I'm still pondering.  It did flag one issue.


Specifically on the alpha pr53645.c failed due to the ASHIFTRT handling.



+case ASHIFTRT:
+  if (CONSTANT_P (XEXP (x, 1))
+ && known_lt (UINTVAL (XEXP (x, 1)), GET_MODE_BITSIZE (mode)))
+   {
+ HOST_WIDE_INT sign = 0;
+ if (HOST_BITS_PER_WIDE_INT - clz_hwi (mask) + INTVAL (XEXP (x, 1))
+ > GET_MODE_BITSIZE (mode).to_constant ())
+   sign = (1ULL << GET_MODE_BITSIZE (mode).to_constant ()) - 1;
+ return sign | (mmask & (mask << INTVAL (XEXP (x, 1;
+   }

The "-1" when computing the sign bit is meant to apply to the shift count.

Jeff

Re: [PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)




> Am 12.12.2023 um 17:50 schrieb Martin Jambor :
> 
> Hi,
> 
> PR 112822 revealed a corner case in load_assign_lhs_subreplacements
> where it creates invalid gimple: an assignment where on the LHS there
> is a complex variable which however is not a gimple register because
> it has partial defs and on the right hand side there is a
> VIEW_CONVERT_EXPR.  This patch invokes force_gimple_operand_gsi on
> such statements (like it already does when both sides of a generated
> assignment have partial definitions.
> 
> I've made sure the patch passes bootstrap and testsuite on x86_64-linux,
> the bug reporter was kind enough to also check the same on an
> powerpc64le-linux (see bugzilla comment #8).
> 
> The testcase has reasonable size but it is specific to ppc64le and its
> altivec vectors.  My plan is to ask the bug reporter to massage it into
> a target specific testcase in bugzilla.  Alternatively I can try to
> craft a testcase from scratch but that will take time.
> 
> Despite the above, is the patch OK for master?

Ok

Richard 

> 
> Thanks,
> 
> Martin
> 
> 
> 
> gcc/ChangeLog:
> 
> 2023-12-12  Martin Jambor  
> 
>PR tree-optimization/112822
>* tree-sra.cc (load_assign_lhs_subreplacements): Invoke
>force_gimple_operand_gsi also when LHS has partial stores and RHS is a
>VIEW_CONVERT_EXPR.
> ---
> gcc/tree-sra.cc | 10 +++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/tree-sra.cc b/gcc/tree-sra.cc
> index 3bd0c7a9af0..99a1b0a6d17 100644
> --- a/gcc/tree-sra.cc
> +++ b/gcc/tree-sra.cc
> @@ -4219,11 +4219,15 @@ load_assign_lhs_subreplacements (struct access *lacc,
>  if (racc && racc->grp_to_be_replaced)
>{
>  rhs = get_access_replacement (racc);
> +  bool vce = false;
>  if (!useless_type_conversion_p (lacc->type, racc->type))
> -rhs = fold_build1_loc (sad->loc, VIEW_CONVERT_EXPR,
> -   lacc->type, rhs);
> +{
> +  rhs = fold_build1_loc (sad->loc, VIEW_CONVERT_EXPR,
> + lacc->type, rhs);
> +  vce = true;
> +}
> 
> -  if (racc->grp_partial_lhs && lacc->grp_partial_lhs)
> +  if (lacc->grp_partial_lhs && (vce || racc->grp_partial_lhs))
>rhs = force_gimple_operand_gsi (>old_gsi, rhs, true,
>NULL_TREE, true, GSI_SAME_STMT);
>}
> --
> 2.43.0
>

Re: Disable FMADD in chains for Zen4 and generic

2023-12-12 Thread Alexander Monakov


On Tue, 12 Dec 2023, Richard Biener wrote:

> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka  wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic 
> > (for
> > x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> >578,500,241  cycles:u #3.645 GHz 
> > ( +-  0.12% )
> >753,318,477  instructions:u   #1.30  insn 
> > per cycle  ( +-  0.00% )
> >125,417,701  branches:u   #  790.227 M/sec   
> > ( +-  0.00% )
> >   0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
> >
> >
> > No FMA:
> >
> >577,573,960  cycles:u #3.514 GHz 
> > ( +-  0.15% )
> >878,318,479  instructions:u   #1.52  insn 
> > per cycle  ( +-  0.00% )
> >125,417,702  branches:u   #  763.035 M/sec   
> > ( +-  0.00% )
> >   0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time 
> > as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> >  484875179  cycles:u #3.599 GHz 
> >  ( +-  0.05% )  (82.11%)
> >  752031517  instructions:u   #1.55  insn 
> > per cycle
> >  125106525  branches:u   #  928.712 M/sec   
> >  ( +-  0.03% )  (85.09%)
> > 128356  branch-misses:u  #0.10% of all 
> > branches  ( +-  0.06% )  (83.58%)
> >
> > No FMA:
> >  375875209  cycles:u #3.592 GHz 
> >  ( +-  0.08% )  (80.74%)
> >  875725341  instructions:u   #2.33  insn 
> > per cycle
> >  124903825  branches:u   #1.194 G/sec   
> >  ( +-  0.04% )  (84.59%)
> >   0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.

> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.

The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.

> Do we actually understand _why_ the FMAs are slower here?

It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.

Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html

> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet?  I'm curious how you set up a micro benchmark to
> measure this.

Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).

> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle.  So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute?  That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?

It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.

> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four.  So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.

A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency 
exceeds FMUL latency (all Zens and Broadwell).

> > Since this seems noticeable win on zen and not loss on Core it seems like 
> >

[PATCH] SRA: Force gimple operand in an additional corner case (PR 112822)

2023-12-12 Thread Martin Jambor

Hi,

PR 112822 revealed a corner case in load_assign_lhs_subreplacements
where it creates invalid gimple: an assignment where on the LHS there
is a complex variable which however is not a gimple register because
it has partial defs and on the right hand side there is a
VIEW_CONVERT_EXPR.  This patch invokes force_gimple_operand_gsi on
such statements (like it already does when both sides of a generated
assignment have partial definitions.

I've made sure the patch passes bootstrap and testsuite on x86_64-linux,
the bug reporter was kind enough to also check the same on an
powerpc64le-linux (see bugzilla comment #8).

The testcase has reasonable size but it is specific to ppc64le and its
altivec vectors.  My plan is to ask the bug reporter to massage it into
a target specific testcase in bugzilla.  Alternatively I can try to
craft a testcase from scratch but that will take time.

Despite the above, is the patch OK for master?

Thanks,

Martin



gcc/ChangeLog:

2023-12-12  Martin Jambor  

PR tree-optimization/112822
* tree-sra.cc (load_assign_lhs_subreplacements): Invoke
force_gimple_operand_gsi also when LHS has partial stores and RHS is a
VIEW_CONVERT_EXPR.
---
 gcc/tree-sra.cc | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-sra.cc b/gcc/tree-sra.cc
index 3bd0c7a9af0..99a1b0a6d17 100644
--- a/gcc/tree-sra.cc
+++ b/gcc/tree-sra.cc
@@ -4219,11 +4219,15 @@ load_assign_lhs_subreplacements (struct access *lacc,
  if (racc && racc->grp_to_be_replaced)
{
  rhs = get_access_replacement (racc);
+ bool vce = false;
  if (!useless_type_conversion_p (lacc->type, racc->type))
-   rhs = fold_build1_loc (sad->loc, VIEW_CONVERT_EXPR,
-  lacc->type, rhs);
+   {
+ rhs = fold_build1_loc (sad->loc, VIEW_CONVERT_EXPR,
+lacc->type, rhs);
+ vce = true;
+   }
 
- if (racc->grp_partial_lhs && lacc->grp_partial_lhs)
+ if (lacc->grp_partial_lhs && (vce || racc->grp_partial_lhs))
rhs = force_gimple_operand_gsi (>old_gsi, rhs, true,
NULL_TREE, true, GSI_SAME_STMT);
}
-- 
2.43.0

Re: Disable FMADD in chains for Zen4 and generic

> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with
> multiple dependent FMAs.
> 
> I can't understand how this uarch detail can affect performance when
> as in the testcase
> the longest input latency is on the multiplication from a memory load.
> Do we actually
> understand _why_ the FMAs are slower here?

This is my understanding:
The loop is well predictable and memory caluclations + loads can happen
in parallel.  So the main dependency chain is updating the accumulator
computing c[i][j].  FMADD is 4 cycles on Zen4, while ADD is 3.  So the
loop with FMADD can not run any faster than one iteration per 4 cycles,
while ADD can do one iteration per 3.  Which roughtly matches the
speedup we see 484875179*3/4=363656384 while measured speed is 375875209
cycles.  The benchmark is quite short and I run it 100 times in perf to
collect the data so the overhead is probably attributing to smaller then
expected difference.

> 
> Do we know that Cores can start the multiplication part when the add
> operand isn't
> ready yet?  I'm curious how you set up a micro benchmark to measure this.

Here is cycle counting benchmark:
#include 
int
main()
{ 
  float o=0;
  for (int i = 0; i < 10; i++)
  {
#ifdef ACCUMULATE
float p1 = o;
float p2 = 0;
#else
float p1 = 0;
float p2 = o;
#endif
float p3 = 0;
#ifdef FMA
asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3));
#else
float t;
asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3));
asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t));
#endif
  }
  printf ("%f\n",o);
  return 0;
}

It performs FMAs in sequence all with zeros.  If you define ACCUMULATE
you get the pattern from matrix multiplication. On Zen I get:

jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep 
cycles:
 4,001,011,489  cycles:u #4.837 GHz 
(83.32%)
jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
 3,000,335,064  cycles:u #4.835 GHz 
(83.08%)

So 4 cycles for FMA loop and 3 cycles for separate mul and add.
Muls execute in parallel to adds in the second case.
If the dependence chain is done over multiplied paramter I get:

jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles:
 4,000,118,069  cycles:u #4.836 GHz 
(83.32%)
jh@ryzen3:~> gcc -O3  l.c ; perf stat ./a.out 2>&1 | grep cycles:
 6,001,947,341  cycles:u #4.838 GHz 
(83.32%)

FMA is the same (it is still one FMA instruction periteration) while
mul+add is 6 cycles since the dependency chain is longer.

Core gives me:

jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep 
cycles:u
 5,001,515,473  cycles:u #3.796 GHz
jh@aster:~> gcc -O3 l.c  -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
 4,000,977,739  cycles:u #3.819 GHz
jh@aster:~> gcc -O3 l.c  -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u
 5,350,523,047  cycles:u #3.814 GHz
jh@aster:~> gcc -O3 l.c   ; perf stat ./a.out 2>&1 | grep cycles:u
10,251,994,240  cycles:u #3.852 GHz

So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do
the long chain.  I think some cores have bigger difference between these
two numbers.
I am bit surprised of the last number of 10 cycles.  I would expect 8.

I changed the matrix multiplication benchmark to repeat multiplication
100 times

> 
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per 
> cycle.
> So in theory we can at most do 2 FMA per cycle but with latency (FMA)
> == 4 for Zen3/4
> and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
> throughput when there are many FADD/FMUL ops to execute?  That works 
> independent
> on whether FMAs have a head-start on multiplication as you'd still be
> bottle-necked
> on the 2-wide issue for FMA?

I am not sure I follow what you say here.  The knob only checks for
FMADDS used in accmulation type loop, so it is latency 4 and latency 3
per accumulation.  Indeed in ohter loops fmadd is win.
> 
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a 
> latency
> of four.  So you should get worse results there (looking at the
> numbers above you
> do get worse results, slightly so), probably the higher number of uops is 
> hidden
> by the latency.
I think the slower non-FMA on Core was just a noise (it shows in overall
time but not in cycle counts).

I changed the benchmark to run the multiplication 100 times.
On Intel I get:

jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out
mult took   15146405 clocks

 Performance counter stats for './a.out':

 15,149.62 msec

Re: [PATCH] Treat "p" in asms as addressing VOIDmode

On Mon, 11 Dec 2023, Richard Sandiford wrote:

> > It all seems a bit hackish.  I don't think ports have had much success 
> > using 'p' through the decades.  I think I generally ended up having to 
> > go with distinct constraints rather than relying on 'p'.
> >
> > OK for the trunk, but ewww.
> 
> Thanks, pushed.  And yeah, eww is fair.  I'd be happy for this to become
> an unconditional VOIDmode once reload is removed.

 Hmm, LRA seems unable currently to work with indexed address modes, such 
as with these address load machine instructions:

movaq   0x12345678[%r1],%r2
movaq   (%r0)[%r1],%r2
movaq   0x12345678(%r0)[%r1],%r2
movaq   *0x12345678[%r1],%r2
movaq   *(%r0)[%r1],%r2
movaq   *0x12345678(%r0)[%r1],%r2

(where R1 is scaled according to the width of data the address refers to 
before adding to the direct or indirect address component worked out from 
base+displacement, by 8 in this example, suitably for DImode or DFmode) so 
who knows what we'll end up with once the VAX port has been converted.

  Maciej

Re: [PATCH v3 2/6] libgomp, openmp: Add ompx_pinned_mem_alloc

2023-12-12 Thread Andrew Stubbs

On 12/12/2023 10:05, Tobias Burnus wrote:

Hi Andrew,

On 11.12.23 18:04, Andrew Stubbs wrote:

This creates a new predefined allocator as a shortcut for using pinned
memory with OpenMP. The name uses the OpenMP extension space and is
intended to be consistent with other OpenMP implementations currently in
development.

Discussed this with Jakub - and 9 does not permit to have a contiguous
range of numbers if OpenMP ever extends this,

Thus, maybe start the ompx_ with 100.

These numbers are not defined in any standard, are they? We can use
whatever enumeration we choose.

I'm happy to change them, but the *_mem_alloc numbers are used as an
index into a constant table to map them to the corresponding
*_mem_space, so do we really want to make it a sparse table?

We were also pondering whether it should be ompx_gnu_pinned_mem_alloc or
ompx_pinned_mem_alloc.

It's a long time ago now, and I'm struggling to remember, but I think
those names were agreed with some other parties (can't remember who
though, and I may be thinking of the ompx_unified_shared_mem_alloc that
is still to come).

The only other compiler supporting this flag seems to be IBM; their
compiler uses ompx_pinned_mem_alloc with the same meaning:
https://www.ibm.com/support/pages/system/files/inline-files/OMP5_User_Reference.pdf
(page 5)

As the obvious meaning is what both compilers have, I am fine without
the "gnu" infix, which Jakub accepted.

Good.

* * *

And you have not updated the compiler itself to support more this new
allocator. Cf.

https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/c-c++-common/gomp/allocate-9.c#L23-L28

That file gives an overview what needs to be changed:

* The check functions mentioned there (seemingly for two ranges now)

* Update the OMP_ALLOCATOR env var parser in env.c

* That linked testcase (and possibly some some more) should be updated,
also to ensure that the new allocator is accepted + to check for new
unsupported values (99, 101 ?)

If we now leave gaps, the const_assert in libgomp/allocator.c probably
needs to be updated as well.

* * *

Glancing through the patches, for test cases, I think you should
'abort()' in CHECK_SIZE if it fails (rlimit issue or not supported
system). Or do you think that the results are still could make sense
when continuing and possibly failing later?

Those were not meant to be part of the test, really, but rather to
demystify failures for future maintainers.

Tobias

Thanks for the review.

Andrew

Re: [PATCH v3 08/11] aarch64: Generalize writeback ldp/stp patterns

Alex Coplan  writes:
> Hi,
>
> This is a v3 patch which is rebased on top of the SME changes.
> Otherwise it is the same as v2, posted here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639367.html
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> -- >8 --
>
> Thus far the writeback forms of ldp/stp have been exclusively used in
> prologue and epilogue code for saving/restoring of registers to/from the
> stack.
>
> As such, forms of ldp/stp that weren't needed for prologue/epilogue code
> weren't supported by the aarch64 backend.  This patch generalizes the
> load/store pair writeback patterns to allow:
>
>  - Base registers other than the stack pointer.
>  - Modes that weren't previously supported.
>  - Combinations of distinct modes provided they have the same size.
>  - Pre/post variants that weren't previously needed in prologue/epilogue
>code.
>
> We make quite some effort to avoid a combinatorial explosion in the
> number of patterns generated (and those in the source) by making
> extensive use of special predicates.
>
> An updated version of the upcoming ldp/stp pass can generate the
> writeback forms, so this patch is motivated by that.
>
> This patch doesn't add zero-extending or sign-extending forms of the
> writeback patterns; that is left for future work.
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-protos.h (aarch64_ldpstp_operand_mode_p): 
> Declare.
> * config/aarch64/aarch64.cc (aarch64_gen_storewb_pair): Build RTL
> directly instead of invoking named pattern.
> (aarch64_gen_loadwb_pair): Likewise.
> (aarch64_ldpstp_operand_mode_p): New.
> * config/aarch64/aarch64.md (loadwb_pair_): Replace 
> with
> ...
> (*loadwb_post_pair_): ... this. Generalize as described
> in cover letter.
> (loadwb_pair_): Delete (superseded by the
> above).
> (*loadwb_post_pair_16): New.
> (*loadwb_pre_pair_): New.
> (loadwb_pair_): Delete.
> (*loadwb_pre_pair_16): New.
> (storewb_pair_): Replace with ...
> (*storewb_pre_pair_): ... this.  Generalize as
> described in cover letter.
> (*storewb_pre_pair_16): New.
> (storewb_pair_): Delete.
> (*storewb_post_pair_): New.
> (storewb_pair_): Delete.
> (*storewb_post_pair_16): New.
> * config/aarch64/predicates.md (aarch64_mem_pair_operator): New.
> (pmode_plus_operator): New.
> (aarch64_ldp_reg_operand): New.
> (aarch64_stp_reg_operand): New.

OK, thanks, although:

> +;; q-register variant of the above
> +(define_insn "*loadwb_pre_pair_16"
> +  [(set (match_operand 0 "pmode_register_operand" "=")
> + (match_operator 8 "pmode_plus_operator" [
> +   (match_operand 1 "pmode_register_operand" "0")
> +   (match_operand 4 "const_int_operand")]))
> +   (set (match_operand:TI 2 "aarch64_ldp_reg_operand" "=w")
> + (match_operator 6 "memory_operand" [
> +   (match_operator 10 "pmode_plus_operator" [
> + (match_dup 1)
> + (match_dup 4)
> +   ])]))
> +   (set (match_operand:TI 3 "aarch64_ldp_reg_operand" "=w")
> + (match_operator 7 "memory_operand" [
> +   (match_operator 9 "pmode_plus_operator" [
> +  (match_dup 1)
> +  (match_operand 5 "const_int_operand")
> +   ])]))]
> +  "TARGET_FLOAT
> +   && aarch64_mem_pair_offset (operands[4], TImode)
> +   && known_eq (INTVAL (operands[5]), INTVAL (operands[4]) + 16)"
> +  "ldp\t%q2, %q3, [%0, %4]!"
>[(set_attr "type" "neon_ldp_q")]

...I think this reads more naturally with the numbering of 9 and 10 swapped.
OK either way.

Sorry for causing the rebase to be necessary.

Richard

Re: [PATCH v2 09/11] aarch64: Rewrite non-writeback ldp/stp patterns

Alex Coplan  writes:
> Hi,
>
> This is a v2 version which addresses feedback from Richard's review
> here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637648.html
>
> I'll reply inline to address specific comments.
>
> Bootstrapped/regtested on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> -- >8 --
>
> This patch overhauls the load/store pair patterns with two main goals:
>
> 1. Fixing a correctness issue (the current patterns are not RA-friendly).
> 2. Allowing more flexibility in which operand modes are supported, and which
>combinations of modes are allowed in the two arms of the load/store pair,
>while reducing the number of patterns required both in the source and in
>the generated code.
>
> The correctness issue (1) is due to the fact that the current patterns have
> two independent memory operands tied together only by a predicate on the 
> insns.
> Since LRA only looks at the constraints, one of the memory operands can get
> reloaded without the other one being changed, leading to the insn becoming
> unrecognizable after reload.
>
> We fix this issue by changing the patterns such that they only ever have one
> memory operand representing the entire pair.  For the store case, we use an
> unspec to logically concatenate the register operands before storing them.
> For the load case, we use unspecs to extract the "lanes" from the pair mem,
> with the second occurrence of the mem matched using a match_dup (such that 
> there
> is still really only one memory operand as far as the RA is concerned).
>
> In terms of the modes used for the pair memory operands, we canonicalize
> these to V2x4QImode, V2x8QImode, and V2x16QImode.  These modes have not
> only the correct size but also correct alignment requirement for a
> memory operand representing an entire load/store pair.  Unlike the other
> two, V2x4QImode didn't previously exist, so had to be added with the
> patch.
>
> As with the previous patch generalizing the writeback patterns, this
> patch aims to be flexible in the combinations of modes supported by the
> patterns without requiring a large number of generated patterns by using
> distinct mode iterators.
>
> The new scheme means we only need a single (generated) pattern for each
> load/store operation of a given operand size.  For the 4-byte and 8-byte
> operand cases, we use the GPI iterator to synthesize the two patterns.
> The 16-byte case is implemented as a separate pattern in the source (due
> to only having a single possible alternative).
>
> Since the UNSPEC patterns can't be interpreted by the dwarf2cfi code,
> we add REG_CFA_OFFSET notes to the store pair insns emitted by
> aarch64_save_callee_saves, so that correct CFI information can still be
> generated.  Furthermore, we now unconditionally generate these CFA
> notes on frame-related insns emitted by aarch64_save_callee_saves.
> This is done in case that the load/store pair pass forms these into
> pairs, in which case the CFA notes would be needed.
>
> We also adjust the ldp/stp peepholes to generate the new form.  This is
> done by switching the generation to use the
> aarch64_gen_{load,store}_pair interface, making it easier to change the
> form in the future if needed.  (Likewise, the upcoming aarch64
> load/store pair pass also makes use of this interface).
>
> This patch also adds an "ldpstp" attribute to the non-writeback
> load/store pair patterns, which is used by the post-RA load/store pair
> pass to identify existing patterns and see if they can be promoted to
> writeback variants.
>
> One potential concern with using unspecs for the patterns is that it can block
> optimization by the generic RTL passes.  This patch series tries to mitigate
> this in two ways:
>  1. The pre-RA load/store pair pass runs very late in the pre-RA pipeline.
>  2. A later patch in the series adjusts the aarch64 mem{cpy,set} expansion to
> emit individual loads/stores instead of ldp/stp.  These should then be
> formed back into load/store pairs much later in the RTL pipeline by the
> new load/store pair pass.
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-ldpstp.md: Abstract ldp/stp
> representation from peepholes, allowing use of new form.
> * config/aarch64/aarch64-modes.def (V2x4QImode): Define.
> * config/aarch64/aarch64-protos.h
> (aarch64_finish_ldpstp_peephole): Declare.
> (aarch64_swap_ldrstr_operands): Delete declaration.
> (aarch64_gen_load_pair): Adjust parameters.
> (aarch64_gen_store_pair): Likewise.
> * config/aarch64/aarch64-simd.md (load_pair):
> Delete.
> (vec_store_pair): Delete.
> (load_pair): Delete.
> (vec_store_pair): Delete.
> * config/aarch64/aarch64.cc (aarch64_pair_mode_for_mode): New.
> (aarch64_gen_store_pair): Adjust to use new unspec form of stp.
> Drop second mem from parameters.
> (aarch64_gen_load_pair): Likewise.
>

Re: [PATCH] expmed: Perform mask extraction via QImode [PR112773].

Robin Dapp  writes:
>> - Change the second mode to vec_extract_optab.  This is only a name
>>   lookup, and it seems more natural to continue using the real element mode.
>
> Am I understanding correctly that this implies we should provide
> a vec_extractbi expander?  (with the innermode being BImode
> here).

Yeah, I think so.  That way the interpretation of the mode stays in
sync with the interpretation of the bit position.  If instead we used
QImode with a bitnum of , the top 7 bits of the
read would logically be out of bounds.

Thanks,
Richard

Re: GCC/Rust libgrust-v2/to-submit branch

2023-12-12 Thread Thomas Schwinge

Hi Arthur, Pierre-Emmanuel!

On 2023-12-12T10:39:50+0100, I wrote:
> On 2023-11-27T16:46:08+0100, I wrote:
>> On 2023-11-21T16:20:22+0100, Arthur Cohen  wrote:
>>> On 11/20/23 15:55, Thomas Schwinge wrote:
 Arthur and Pierre-Emmanuel have prepared a GCC/Rust libgrust-v2/to-submit
 branch: .

> Rebasing onto current master branch, there's a minor (textual) conflict
> in top-level 'configure.ac:host_libs': 'intl' replaced by 'gettext', and
> top-level 'configure' plus 'gcc/configure' have to be re-generated (the
> latter for some unrelated changes in line numbers).  Otherwise, those
> initial libgrust changes are now in the form that I thought they should
> be in -- so I suggest you fix that up (I can quickly have a look again,
> if you like)

I've noticed that you've fix that up (looks good), but I also noticed one
additional small item: into "build: Add libgrust as compilation modules",
you'll have to add the effect of top-level 'autogen Makefile.def' (that
is, regenerate the top-level 'Makefile.in').


Grüße
 Thomas


> and then you do the "scary" 'git push' ;-) -- and then:
>
>>> All the best, and thanks again for testing :)
>>
>> :-) So I hope I've not missed any major issues...
>
> ..., we wait and see.  :-)
>
>
> Grüße
>  Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]


On 12/12/23 06:15, Jakub Jelinek wrote:

On Tue, Dec 12, 2023 at 02:13:43PM +0300, Alexander Monakov wrote:



On Tue, 12 Dec 2023, Jakub Jelinek wrote:


On Mon, Dec 11, 2023 at 05:00:50PM -0500, Jason Merrill wrote:

In discussion of PR71093 it came up that more clobber_kind options would be
useful within the C++ front-end.

gcc/ChangeLog:

* tree-core.h (enum clobber_kind): Rename CLOBBER_EOL to
CLOBBER_STORAGE_END.  Add CLOBBER_STORAGE_BEGIN,
CLOBBER_OBJECT_BEGIN, CLOBBER_OBJECT_END.
* gimple-lower-bitint.cc
* gimple-ssa-warn-access.cc
* gimplify.cc
* tree-inline.cc
* tree-ssa-ccp.cc: Adjust for rename.


Doesn't build_clobber_this in the C++ front-end need to be adjusted too?
I think it is used to place clobbers at start of the ctor (should be
CLOBBER_OBJECT_BEGIN in the new nomenclature) and end of the dtor (i.e.
CLOBBER_OBJECT_END).


You're right.


I had been thinking to leave that to Nathaniel's patch, but sure, I'll 
hoist those bits out:
From 29b54f1e2a832f74bdbdba738991d3330b0b4577 Mon Sep 17 00:00:00 2001
From: Jason Merrill 
Date: Mon, 11 Dec 2023 11:35:31 -0500
Subject: [PATCH] tree: add to clobber_kind
To: gcc-patches@gcc.gnu.org

In discussion of PR71093 it came up that more clobber_kind options would be
useful within the C++ front-end.

gcc/ChangeLog:

	* tree-core.h (enum clobber_kind): Rename CLOBBER_EOL to
	CLOBBER_STORAGE_END.  Add CLOBBER_STORAGE_BEGIN,
	CLOBBER_OBJECT_BEGIN, CLOBBER_OBJECT_END.
	* gimple-lower-bitint.cc
	* gimple-ssa-warn-access.cc
	* gimplify.cc
	* tree-inline.cc
	* tree-ssa-ccp.cc: Adjust for rename.
	* tree-pretty-print.cc: And handle new values.

gcc/cp/ChangeLog:

	* call.cc (build_trivial_dtor_call): Use CLOBBER_OBJECT_END.
	* decl.cc (build_clobber_this): Take clobber_kind argument.
	(start_preparsed_function): Pass CLOBBER_OBJECT_BEGIN.
	(begin_destructor_body): Pass CLOBBER_OBJECT_END.

gcc/testsuite/ChangeLog:

	* gcc.dg/pr87052.c: Adjust expected CLOBBER output.

Co-authored-by: Nathaniel Shead  
---
 gcc/tree-core.h| 13 ++---
 gcc/cp/call.cc |  2 +-
 gcc/cp/decl.cc |  9 +
 gcc/gimple-lower-bitint.cc | 10 ++
 gcc/gimple-ssa-warn-access.cc  |  2 +-
 gcc/gimplify.cc|  9 +
 gcc/testsuite/gcc.dg/pr87052.c |  4 ++--
 gcc/tree-inline.cc |  6 --
 gcc/tree-pretty-print.cc   | 19 +--
 gcc/tree-ssa-ccp.cc|  2 +-
 10 files changed, 52 insertions(+), 24 deletions(-)

diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index 04c04cf2f37..58aa598f3bb 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -986,12 +986,19 @@ enum annot_expr_kind {
   annot_expr_kind_last
 };
 
-/* The kind of a TREE_CLOBBER_P CONSTRUCTOR node.  */
+/* The kind of a TREE_CLOBBER_P CONSTRUCTOR node.  Other than _UNDEF, these are
+   in roughly sequential order.  */
 enum clobber_kind {
   /* Unspecified, this clobber acts as a store of an undefined value.  */
   CLOBBER_UNDEF,
-  /* This clobber ends the lifetime of the storage.  */
-  CLOBBER_EOL,
+  /* Beginning of storage duration, e.g. malloc.  */
+  CLOBBER_STORAGE_BEGIN,
+  /* Beginning of object lifetime, e.g. C++ constructor.  */
+  CLOBBER_OBJECT_BEGIN,
+  /* End of object lifetime, e.g. C++ destructor.  */
+  CLOBBER_OBJECT_END,
+  /* End of storage duration, e.g. free.  */
+  CLOBBER_STORAGE_END,
   CLOBBER_LAST
 };
 
diff --git a/gcc/cp/call.cc b/gcc/cp/call.cc
index 4f0abf8e93f..aaee34f35b0 100644
--- a/gcc/cp/call.cc
+++ b/gcc/cp/call.cc
@@ -9716,7 +9716,7 @@ build_trivial_dtor_call (tree instance, bool no_ptr_deref)
 }
 
   /* A trivial destructor should still clobber the object.  */
-  tree clobber = build_clobber (TREE_TYPE (instance));
+  tree clobber = build_clobber (TREE_TYPE (instance), CLOBBER_OBJECT_END);
   return build2 (MODIFY_EXPR, void_type_node,
 		 instance, clobber);
 }
diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index b1ada1d5215..4d17ead123a 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -17401,7 +17401,7 @@ implicit_default_ctor_p (tree fn)
storage is dead when we enter the constructor or leave the destructor.  */
 
 static tree
-build_clobber_this ()
+build_clobber_this (clobber_kind kind)
 {
   /* Clobbering an empty base is pointless, and harmful if its one byte
  TYPE_SIZE overlays real data.  */
@@ -17417,7 +17417,7 @@ build_clobber_this ()
   if (!vbases)
 ctype = CLASSTYPE_AS_BASE (ctype);
 
-  tree clobber = build_clobber (ctype);
+  tree clobber = build_clobber (ctype, kind);
 
   tree thisref = current_class_ref;
   if (ctype != current_class_type)
@@ -17836,7 +17836,7 @@ start_preparsed_function (tree decl1, tree attrs, int flags)
 	 because part of the initialization might happen before we enter the
 	 constructor, via AGGR_INIT_ZERO_FIRST (c++/68006).  */
   && !implicit_default_ctor_p (decl1))
-finish_expr_stmt (build_clobber_this ());
+finish_expr_stmt

[PATCH v3] aarch64,arm: Move branch-protection data to targets

2023-12-12 Thread Szabolcs Nagy

The branch-protection types are target specific, not the same on arm
and aarch64.  This currently affects pac-ret+b-key, but there will be
a new type on aarch64 that is not relevant for arm.

After the move, change aarch_ identifiers to aarch64_ or arm_ as
appropriate.

gcc/ChangeLog:

* config/aarch64/aarch64.md: Rename aarch_ to aarch64_.
* config/aarch64/aarch64.opt: Likewise.
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Likewise.
* config/aarch64/aarch64.cc (aarch64_expand_prologue): Likewise.
(aarch64_expand_epilogue): Likewise.
(aarch64_post_cfi_startproc): Likewise.
(aarch64_handle_no_branch_protection): Copy and rename.
(aarch64_handle_standard_branch_protection): Likewise.
(aarch64_handle_pac_ret_protection): Likewise.
(aarch64_handle_pac_ret_leaf): Likewise.
(aarch64_handle_pac_ret_b_key): Likewise.
(aarch64_handle_bti_protection): Likewise.
* config/arm/aarch-common.cc (aarch_handle_no_branch_protection):
Remove.
(aarch_handle_standard_branch_protection): Remove.
(aarch_handle_pac_ret_protection): Remove.
(aarch_handle_pac_ret_leaf): Remove.
(aarch_handle_pac_ret_b_key): Remove.
(aarch_handle_bti_protection): Remove.
* config/arm/aarch-common.h (enum aarch_key_type): Remove.
(struct aarch_branch_protect_type): Declare.
* config/arm/arm-c.cc (arm_cpu_builtins): Remove aarch_ra_sign_key.
* config/arm/arm.cc (arm_handle_no_branch_protection): Copy and rename.
(arm_handle_standard_branch_protection): Likewise.
(arm_handle_pac_ret_protection): Likewise.
(arm_handle_pac_ret_leaf): Likewise.
(arm_handle_bti_protection): Likewise.
(arm_configure_build_target): Likewise.
* config/arm/arm.opt: Remove aarch_ra_sign_key.
---
v3: aarch_ to aarch64_/arm_ renames.
---
 gcc/config/aarch64/aarch64-c.cc |  4 +-
 gcc/config/aarch64/aarch64.cc   | 69 +
 gcc/config/aarch64/aarch64.md   |  2 +-
 gcc/config/aarch64/aarch64.opt  |  2 +-
 gcc/config/arm/aarch-common.cc  | 55 --
 gcc/config/arm/aarch-common.h   | 11 +++---
 gcc/config/arm/arm-c.cc |  2 -
 gcc/config/arm/arm.cc   | 52 ++---
 gcc/config/arm/arm.opt  |  3 --
 9 files changed, 117 insertions(+), 83 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index 115a2a8b756..553c99845e2 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -235,9 +235,9 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
   if (aarch_ra_sign_scope != AARCH_FUNCTION_NONE)
 {
   int v = 0;
-  if (aarch_ra_sign_key == AARCH_KEY_A)
+  if (aarch64_ra_sign_key == AARCH64_KEY_A)
v |= 1;
-  if (aarch_ra_sign_key == AARCH_KEY_B)
+  if (aarch64_ra_sign_key == AARCH64_KEY_B)
v |= 2;
   if (aarch_ra_sign_scope == AARCH_FUNCTION_ALL)
v |= 4;
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9530618abea..dfd374c901e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -9461,12 +9461,12 @@ aarch64_expand_prologue (void)
   /* Sign return address for functions.  */
   if (aarch64_return_address_signing_enabled ())
 {
-  switch (aarch_ra_sign_key)
+  switch (aarch64_ra_sign_key)
{
- case AARCH_KEY_A:
+ case AARCH64_KEY_A:
insn = emit_insn (gen_paciasp ());
break;
- case AARCH_KEY_B:
+ case AARCH64_KEY_B:
insn = emit_insn (gen_pacibsp ());
break;
  default:
@@ -9880,12 +9880,12 @@ aarch64_expand_epilogue (rtx_call_insn *sibcall)
   if (aarch64_return_address_signing_enabled ()
   && (sibcall || !TARGET_ARMV8_3))
 {
-  switch (aarch_ra_sign_key)
+  switch (aarch64_ra_sign_key)
{
- case AARCH_KEY_A:
+ case AARCH64_KEY_A:
insn = emit_insn (gen_autiasp ());
break;
- case AARCH_KEY_B:
+ case AARCH64_KEY_B:
insn = emit_insn (gen_autibsp ());
break;
  default:
@@ -18541,6 +18541,61 @@ aarch64_set_asm_isa_flags (aarch64_feature_flags flags)
   aarch64_set_asm_isa_flags (_options, flags);
 }
 
+static void
+aarch64_handle_no_branch_protection (void)
+{
+  aarch_ra_sign_scope = AARCH_FUNCTION_NONE;
+  aarch_enable_bti = 0;
+}
+
+static void
+aarch64_handle_standard_branch_protection (void)
+{
+  aarch_ra_sign_scope = AARCH_FUNCTION_NON_LEAF;
+  aarch64_ra_sign_key = AARCH64_KEY_A;
+  aarch_enable_bti = 1;
+}
+
+static void
+aarch64_handle_pac_ret_protection (void)
+{
+  aarch_ra_sign_scope = AARCH_FUNCTION_NON_LEAF;
+  aarch64_ra_sign_key = AARCH64_KEY_A;
+}
+
+static void
+aarch64_handle_pac_ret_leaf (void)
+{
+  aarch_ra_sign_scope = AARCH_FUNCTION_ALL;
+}
+
+static

Re: [PATCH] expmed: Perform mask extraction via QImode [PR112773].

2023-12-12 Thread Robin Dapp

> - Change the second mode to vec_extract_optab.  This is only a name
>   lookup, and it seems more natural to continue using the real element mode.

Am I understanding correctly that this implies we should provide
a vec_extractbi expander?  (with the innermode being BImode
here).

Regards
 Robin

Re: Disable FMADD in chains for Zen4 and generic

On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka  wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
>578,500,241  cycles:u #3.645 GHz   
>   ( +-  0.12% )
>753,318,477  instructions:u   #1.30  insn per 
> cycle  ( +-  0.00% )
>125,417,701  branches:u   #  790.227 M/sec 
>   ( +-  0.00% )
>   0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
>
>
> No FMA:
>
>577,573,960  cycles:u #3.514 GHz   
>   ( +-  0.15% )
>878,318,479  instructions:u   #1.52  insn per 
> cycle  ( +-  0.00% )
>125,417,702  branches:u   #  763.035 M/sec 
>   ( +-  0.00% )
>   0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as 
> FMA.
>
> While on zen:
>
>
> With FMA:
>  484875179  cycles:u #3.599 GHz   
>( +-  0.05% )  (82.11%)
>  752031517  instructions:u   #1.55  insn per 
> cycle
>  125106525  branches:u   #  928.712 M/sec 
>( +-  0.03% )  (85.09%)
> 128356  branch-misses:u  #0.10% of all 
> branches  ( +-  0.06% )  (83.58%)
>
> No FMA:
>  375875209  cycles:u #3.592 GHz   
>( +-  0.08% )  (80.74%)
>  875725341  instructions:u   #2.33  insn per 
> cycle
>  124903825  branches:u   #1.194 G/sec 
>( +-  0.04% )  (84.59%)
>   0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.

This came up in a separate thread as well, but when doing reassoc of a
chain with
multiple dependent FMAs.

I can't understand how this uarch detail can affect performance when
as in the testcase
the longest input latency is on the multiplication from a memory load.
Do we actually
understand _why_ the FMAs are slower here?

Do we know that Cores can start the multiplication part when the add
operand isn't
ready yet?  I'm curious how you set up a micro benchmark to measure this.

There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle.
So in theory we can at most do 2 FMA per cycle but with latency (FMA)
== 4 for Zen3/4
and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
throughput when there are many FADD/FMUL ops to execute?  That works independent
on whether FMAs have a head-start on multiplication as you'd still be
bottle-necked
on the 2-wide issue for FMA?

On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency
of four.  So you should get worse results there (looking at the
numbers above you
do get worse results, slightly so), probably the higher number of uops is hidden
by the latency.

> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.

complaint!

Richard.

> Honza
>
> #include 
> #include 
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
>int i, j, k;
>for(i=0; i{
>   for(j=0; j   {
>  a[i][j] = (float)i + j;
>  b[i][j] = (float)i - j;
>  c[i][j] = 0.0f;
>   }
>}
> }
>
> void mult(void)
> {
>int i, j, k;
>
>for(i=0; i{
>   for(j=0; j   {
>  for(k=0; k  {
> c[i][j] += a[i][k] * b[k][j];
>  }
>   }
>}
> }
>
> int main(void)
> {
>clock_t s, e;
>
>init();
>s=clock();
>mult();
>e=clock();
>printf("mult took %10d clocks\n", (int)(e-s));
>
>return 0;
>
> }
>
> * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> X86_TUNE_AVOID_256FMA_CHAINS)
> Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> "use_scatter_8parts",
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops

Re: [PATCH] multiflags: fix doc warning properly

2023-12-12 Thread Joseph Myers

On Mon, 11 Dec 2023, Alexandre Oliva wrote:

> On Dec 11, 2023, Joseph Myers  wrote:
> 
> > On Fri, 8 Dec 2023, Alexandre Oliva wrote:
> >> @@ -20589,7 +20589,7 @@ allocation before or after interprocedural 
> >> optimization.
> >> This option enables multilib-aware @code{TFLAGS} to be used to build
> >> target libraries with options different from those the compiler is
> >> configured to use by default, through the use of specs (@xref{Spec
> >> -Files}) set up by compiler internals, by the target, or by builders at
> >> +Files}.) set up by compiler internals, by the target, or by builders at
> 
> > The proper change in this context is to use @pxref instead of @xref.
> 
> Oooh, nice!  Thank you!
> 
> Here's a presumably proper fix on top of the earlier one, then.  Tested
> on x86_64-linux-gnu.  Ok to install?
> 
> 
> Rather than a dubious fix for a dubious warning, namely adding a
> period after a parenthesized @xref because the warning demands it, use
> @pxref that is meant for exactly this case.  Thanks to Joseph Myers
> for introducing me to it.

OK.

-- 
Joseph S. Myers
jos...@codesourcery.com

Re: [PATCH] expmed: Perform mask extraction via QImode [PR112773].

Robin Dapp  writes:
> What also works is something like:
>
>   scalar_mode extract_mode = innermode;
>   if (GET_MODE_CLASS (outermode) == MODE_VECTOR_BOOL)
>   extract_mode = smallest_int_mode_for_size
> (GET_MODE_PRECISION (innermode));
>
> however
>
>> So yes, I guess we need to answer BImode vs. QImode.  I hope Richard
>> has a better idea here?
>
> aarch64's predicate vec_extract is:
>
> (define_expand "vec_extract"
>   [(match_operand: 0 "register_operand")
>(match_operand: 1 "register_operand")
>(match_operand:SI 2 "nonmemory_operand")
>;; Dummy operand to which we can attach the iterator.
>(reg:SVE_FULL_I V0_REGNUM)]
>
> So if I'm reading this correctly they are using the element
> mode of the associated full vector mode for extraction rather
> than QImode.
>
> I could also do something similar for the riscv backend but
> that still wouldn't yield a BImode vec_extract result of course
> and expmed would need to be adjusted.  Do we even know the
> original associated non-predicate mode here?  I suppose not?
>
> Do we need a mov from/to BImode instead?
>
> Maybe Richard has a good idea.
>
> Even though I haven't seen it being hit, vec_set in expmed
> would have the same problem?

The patch seemed to be doing three things:

- Use GET_MODE_PRECISION instead of GET_MODE_BITSIZE.  I agree that this
  makes sense on the face of it.

- Change the second mode to vec_extract_optab.  This is only a name
  lookup, and it seems more natural to continue using the real element mode.

- Change the mode of the output operand.  Here we could use
  insn_data[icode].operand[0].mode instead of innermode.

Thanks,
Richard

Re: PING^1 [PATCH] range: Workaround different type precision issue between _Float128 and long double [PR112788]

2023-12-12 Thread Jakub Jelinek

On Tue, Dec 12, 2023 at 09:33:38AM -0500, Andrew MacLeod wrote:
> I leave this for the release managers, but I am not opposed to it for this
> release... It would be nice to remove it for the next release

I can live with it for GCC 14, so ok, but it is very ugly.

We should fix it in a better way for GCC 15+.
I think we shouldn't lie, both on the mode precisions and on type
precisions.  The middle-end already contains some hacks to make it
work to some extent on 2 different modes with same precision (for BFmode vs.
HFmode), on the FE side if we need a target hook the C/C++ FE will use
to choose type ranks and/or the type for binary operations, so be it.
It would be also great if rs6000 backend had just 2 modes for 128-bit
floats, one for IBM double double, one for IEEE quad, not 3 as it has now,
perhaps with TFmode being a macro that conditionally expands to one or the
other.  Or do some tweaks in target hooks to keep backwards compatibility
with mode attribute and similar.

Jakub

Disable FMADD in chains for Zen4 and generic

Hi,
this patch disables use of FMA in matrix multiplication loop for generic (for
x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.

For Intel this is neutral both on the matrix multiplication microbenchmark
(attached) and spec2k17 where the difference was within noise for Core.

On core the micro-benchmark runs as follows:

With FMA:

   578,500,241  cycles:u #3.645 GHz 
( +-  0.12% )
   753,318,477  instructions:u   #1.30  insn per 
cycle  ( +-  0.00% )
   125,417,701  branches:u   #  790.227 M/sec   
( +-  0.00% )
  0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )


No FMA:

   577,573,960  cycles:u #3.514 GHz 
( +-  0.15% )
   878,318,479  instructions:u   #1.52  insn per 
cycle  ( +-  0.00% )
   125,417,702  branches:u   #  763.035 M/sec   
( +-  0.00% )
  0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )

So the cycle count is unchanged and discrete multiply+add takes same time as 
FMA.

While on zen:


With FMA:
 484875179  cycles:u #3.599 GHz 
 ( +-  0.05% )  (82.11%)
 752031517  instructions:u   #1.55  insn per 
cycle 
 125106525  branches:u   #  928.712 M/sec   
 ( +-  0.03% )  (85.09%)
128356  branch-misses:u  #0.10% of all 
branches  ( +-  0.06% )  (83.58%)

No FMA:
 375875209  cycles:u #3.592 GHz 
 ( +-  0.08% )  (80.74%)
 875725341  instructions:u   #2.33  insn per 
cycle
 124903825  branches:u   #1.194 G/sec   
 ( +-  0.04% )  (84.59%)
  0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )

The diffrerence is that Cores understand the fact that fmadd does not need
all three parameters to start computation, while Zen cores doesn't.

Since this seems noticeable win on zen and not loss on Core it seems like good
default for generic.

I plan to commit the patch next week if there are no compplains.

Honza

#include 
#include 

#define SIZE 1000

float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];

void init(void)
{
   int i, j, k;
   for(i=0; i

Re: PING^1 [PATCH] range: Workaround different type precision issue between _Float128 and long double [PR112788]

2023-12-12 Thread Andrew MacLeod

I leave this for the release managers, but I am not opposed to it for 
this release... It would be nice to remove it for the next release


Andrew



On 12/12/23 01:07, Kewen.Lin wrote:

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639140.html

BR,
Kewen

on 2023/12/4 17:49, Kewen.Lin wrote:

Hi,

As PR112788 shows, on rs6000 with -mabi=ieeelongdouble type _Float128
has the different type precision (128) from that (127) of type long
double, but actually they has the same underlying mode, so they have
the same precision as the mode indicates the same real type format
ieee_quad_format.

It's not sensible to have such two types which have the same mode but
different type precisions, some fix attempt was posted at [1].
As the discussion there, there are some historical reasons and
practical issues.  Considering we passed stage 1 and it also affected
the build as reported, this patch is trying to temporarily workaround
it.  I thought to introduce a hookpod but that seems a bit overkill,
assuming scalar float type with the same mode should have the same
precision looks sensible.

Bootstrapped and regtested on powerpc64-linux-gnu P7/P8/P9 and
powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

[1] 
https://inbox.sourceware.org/gcc-patches/718677e7-614d-7977-312d-05a75e1fd...@linux.ibm.com/

BR,
Kewen

PR tree-optimization/112788

gcc/ChangeLog:

* value-range.h (range_compatible_p): Workaround same type mode but
different type precision issue for rs6000 scalar float types
_Float128 and long double.
---
  gcc/value-range.h | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/gcc/value-range.h b/gcc/value-range.h
index 33f204a7171..d0a84754a10 100644
--- a/gcc/value-range.h
+++ b/gcc/value-range.h
@@ -1558,7 +1558,13 @@ range_compatible_p (tree type1, tree type2)
// types_compatible_p requires conversion in both directions to be useless.
// GIMPLE only requires a cast one way in order to be compatible.
// Ranges really only need the sign and precision to be the same.
-  return (TYPE_PRECISION (type1) == TYPE_PRECISION (type2)
- && TYPE_SIGN (type1) == TYPE_SIGN (type2));
+  return TYPE_SIGN (type1) == TYPE_SIGN (type2)
+&& (TYPE_PRECISION (type1) == TYPE_PRECISION (type2)
+// FIXME: As PR112788 shows, for now on rs6000 _Float128 has
+// type precision 128 while long double has type precision 127
+// but both have the same mode so their precision is actually
+// the same, workaround it temporarily.
+|| (SCALAR_FLOAT_TYPE_P (type1)
+&& TYPE_MODE (type1) == TYPE_MODE (type2)));
  }
  #endif // GCC_VALUE_RANGE_H
--
2.42.0

[PATCH] RISC-V: Apply vla vs. vls mode heuristic vector COST model

2023-12-12 Thread Juzhe-Zhong

This patch apply vla vs. vls mode heuristic which can fixes the following FAILs:
FAIL: gcc.target/riscv/rvv/autovec/pr111751.c -O3 -ftree-vectorize
scan-assembler-not vset
FAIL: gcc.target/riscv/rvv/autovec/pr111751.c -O3 -ftree-vectorize
scan-assembler-times li\\s+[a-x0-9]+,0\\s+ret 2

The root cause of this FAIL is we failed to pick VLS mode for the vectorization.

Before this patch:

foo2:
addisp,sp,-208
addia2,sp,64
addia5,sp,128
lui a6,%hi(.LANCHOR0)
sd  ra,200(sp)
addia6,a6,%lo(.LANCHOR0)
mv  a0,a2
mv  a1,a5
li  a3,16
mv  a4,sp
vsetivlizero,8,e64,m8,ta,ma
vle64.v v8,0(a6)
vse64.v v8,0(a2)
vse64.v v8,0(a5)
.L4:
vsetvli a5,a3,e32,m1,ta,ma
sllia2,a5,2
vle32.v v2,0(a1)
vle32.v v1,0(a0)
sub a3,a3,a5
vadd.vv v1,v1,v2
vse32.v v1,0(a4)
add a1,a1,a2
add a0,a0,a2
add a4,a4,a2
bne a3,zero,.L4
lw  a4,128(sp)
lw  a5,64(sp)
addwa5,a5,a4
lw  a4,0(sp)
bne a4,a5,.L5
lw  a4,132(sp)
lw  a5,68(sp)
addwa5,a5,a4
lw  a4,4(sp)
bne a4,a5,.L5
lw  a4,136(sp)
lw  a5,72(sp)
addwa5,a5,a4
lw  a4,8(sp)
bne a4,a5,.L5
lw  a4,140(sp)
lw  a5,76(sp)
addwa5,a5,a4
lw  a4,12(sp)
bne a4,a5,.L5
lw  a4,144(sp)
lw  a5,80(sp)
addwa5,a5,a4
lw  a4,16(sp)
bne a4,a5,.L5
lw  a4,148(sp)
lw  a5,84(sp)
addwa5,a5,a4
lw  a4,20(sp)
bne a4,a5,.L5
lw  a4,152(sp)
lw  a5,88(sp)
addwa5,a5,a4
lw  a4,24(sp)
bne a4,a5,.L5
lw  a4,156(sp)
lw  a5,92(sp)
addwa5,a5,a4
lw  a4,28(sp)
bne a4,a5,.L5
lw  a4,160(sp)
lw  a5,96(sp)
addwa5,a5,a4
lw  a4,32(sp)
bne a4,a5,.L5
lw  a4,164(sp)
lw  a5,100(sp)
addwa5,a5,a4
lw  a4,36(sp)
bne a4,a5,.L5
lw  a4,168(sp)
lw  a5,104(sp)
addwa5,a5,a4
lw  a4,40(sp)
bne a4,a5,.L5
lw  a4,172(sp)
lw  a5,108(sp)
addwa5,a5,a4
lw  a4,44(sp)
bne a4,a5,.L5
lw  a4,176(sp)
lw  a5,112(sp)
addwa5,a5,a4
lw  a4,48(sp)
bne a4,a5,.L5
lw  a4,180(sp)
lw  a5,116(sp)
addwa5,a5,a4
lw  a4,52(sp)
bne a4,a5,.L5
lw  a4,184(sp)
lw  a5,120(sp)
addwa5,a5,a4
lw  a4,56(sp)
bne a4,a5,.L5
lw  a4,188(sp)
lw  a5,124(sp)
addwa5,a5,a4
lw  a4,60(sp)
bne a4,a5,.L5
ld  ra,200(sp)
li  a0,0
addisp,sp,208
jr  ra
.L5:
callabort

After this patch:

li  a0,0
ret

The heuristic leverage ARM SVE and fully tested and confirm we have same 
behavior
as ARM SVE GCC and RVV Clang.

gcc/ChangeLog:

* config/riscv/riscv-vector-costs.cc (costs::analyze_loop_vinfo): New 
function.
(costs::record_potential_vls_unrolling): Ditto.
(costs::prefer_unrolled_loop): Ditto.
(costs::better_main_loop_than_p): Ditto.
(costs::add_stmt_cost): Ditto.
* config/riscv/riscv-vector-costs.h (enum cost_type_enum): New enum.
* config/riscv/t-riscv: Add new include files.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr111313.c: Adapt test.
* gcc.target/riscv/rvv/autovec/vls/shift-3.c: Ditto.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-10.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-11.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-12.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-8.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/vla_vs_vls-9.c: New test.

---
 gcc/config/riscv/riscv-vector-costs.cc| 134 +-
 gcc/config/riscv/riscv-vector-costs.h |  43 ++

Re: [PATCH] strub: add note on attribute access

> On Dec  7, 2023, Alexandre Oliva  wrote:
> 
> > Thanks for raising the issue.  Maybe there should be at least a comment
> > there, and perhaps some asserts to check that pointer and reference
> > types don't make to indirect_parms.
> 
> Document why attribute access doesn't need the same treatment as fn
> spec, and check that the assumption behind it holds.
> 
> Regstrapped on x86_64-linux-gnu.  Ok to install?
> 
> 
> for  gcc/ChangeLog
> 
>   * ipa-strub.cc (pass_ipa_strub::execute): Check that we don't
>   add indirection to pointer parameters, and document attribute
>   access non-interactions.
OK,
Honza

Re: [PATCH] ipa/92606 - properly handle no_icf attribute for variables

> The following adds no_icf handling for variables where the attribute
> was rejected.  It also fixes the check for no_icf by checking both
> the source and the targets decl.
> 
> Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> 
> This would solve the AVR issue with merging of "progmem" attributed
> and non-"progmem" attributed variables if they'd also add no_icf there.
> 
> OK?
> 
> Thanks,
> Richard.
> 
>   PR ipa/92606
> gcc/c-family/
>   * c-attribs.cc (handle_noicf_attribute): Also allow the
>   attribute on global variables.
> 
> gcc/
>   * ipa-icf.cc (sem_item_optimizer::merge_classes): Check
>   both source and alias for the no_icf attribute.
>   * doc/extend.texi (no_icf): Document variable attribute.
OK,
thanks!
Honza

[PATCH] tree-optimization/112961 - include latch in if-conversion CSE

The following makes sure to also process the (empty) latch when
performing CSE on the if-converted loop body.  That's important
to get all uses of copies propagated out on the backedge as well.
To avoid CSE on the PHI nodes itself which is prohibitive
(see PR90402) this temporarily adds a fake entry edge to the loop.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/112961
* tree-if-conv.cc (tree_if_conversion): Instead of excluding
the latch block from VN, add a fake entry edge.

* g++.dg/vect/pr112961.cc: New testcase.
---
 gcc/testsuite/g++.dg/vect/pr112961.cc | 17 +
 gcc/tree-if-conv.cc   |  9 +++--
 2 files changed, 24 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/vect/pr112961.cc

diff --git a/gcc/testsuite/g++.dg/vect/pr112961.cc 
b/gcc/testsuite/g++.dg/vect/pr112961.cc
new file mode 100644
index 000..52759e180fb
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr112961.cc
@@ -0,0 +1,17 @@
+// { dg-do compile }
+// { dg-require-effective-target vect_int }
+
+inline const int& maxx (const int& a, const int )
+{
+  return a > b ? a : b;
+}
+
+int foo(int *a)
+{
+  int max = 0;
+  for (int i = 0; i < 1024; ++i)
+max = maxx(max, a[i]);
+  return max;
+}
+
+// { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { xfail 
vect_no_int_min_max } } }
diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index 0bde281c246..f9fd0149937 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -3734,7 +3734,7 @@ tree_if_conversion (class loop *loop, vec 
*preds)
   auto_vec  reads_to_lower;
   auto_vec  writes_to_lower;
   bitmap exit_bbs;
-  edge pe;
+  edge pe, e;
   auto_vec refs;
   bool loop_versioned;
 
@@ -3894,11 +3894,13 @@ tree_if_conversion (class loop *loop, vec 
*preds)
   /* Perform local CSE, this esp. helps the vectorizer analysis if loads
  and stores are involved.  CSE only the loop body, not the entry
  PHIs, those are to be kept in sync with the non-if-converted copy.
+ Do this by adding a fake entry edge - we do want to include the
+ latch as otherwise copies on a reduction path cannot be propagated out.
  ???  We'll still keep dead stores though.  */
+  e = make_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun), loop->header, EDGE_FAKE);
   exit_bbs = BITMAP_ALLOC (NULL);
   for (edge exit : get_loop_exit_edges (loop))
 bitmap_set_bit (exit_bbs, exit->dest->index);
-  bitmap_set_bit (exit_bbs, loop->latch->index);
 
   std::pair  *name_pair;
   unsigned ssa_names_idx;
@@ -3908,6 +3910,9 @@ tree_if_conversion (class loop *loop, vec 
*preds)
 
   todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
 
+  /* Remove the fake edge again.  */
+  remove_edge (e);
+
   /* Delete dead predicate computations.  */
   ifcvt_local_dce (loop);
   BITMAP_FREE (exit_bbs);
-- 
2.35.3

[PATCH DejaGNU 1/1] Support per-test execution timeout factor

Add support for the `test_timeout_factor' global variable letting a test 
case scale the wait timeout used for code execution.  This is useful for 
particularly slow test cases for which increasing the wait timeout 
globally would be excessive.

* baseboards/qemu.exp (qemu_load): Handle `test_timeout_factor'.
* config/gdb-comm.exp (gdb_comm_load): Likewise.
* config/gdb_stub.exp (gdb_stub_load): Likewise.
* config/sim.exp (sim_load): Likewise.
* config/unix.exp (unix_load): Likewise.
* doc/dejagnu.texi (Local configuration file): Document 
`test_timeout_factor'.
---
 baseboards/qemu.exp |4 
 config/gdb-comm.exp |4 
 config/gdb_stub.exp |4 
 config/sim.exp  |4 
 config/unix.exp |4 
 doc/dejagnu.texi|   10 +-
 6 files changed, 29 insertions(+), 1 deletion(-)

dejagnu-test-timeout-factor.diff
Index: dejagnu/baseboards/qemu.exp
===
--- dejagnu.orig/baseboards/qemu.exp
+++ dejagnu/baseboards/qemu.exp
@@ -200,11 +200,15 @@ proc qemu_load { dest prog args } {
 global qemu
 global timeout
 global test_timeout
+global test_timeout_factor
 
 set wait_timeout $timeout
 if {[info exists test_timeout]} {
set wait_timeout $test_timeout
 }
+if {[info exists test_timeout_factor]} {
+   set wait_timeout [expr $wait_timeout * $test_timeout_factor]
+}
 
 verbose -log "Executing on $dest: $prog (timeout = $wait_timeout)" 2
 
Index: dejagnu/config/gdb-comm.exp
===
--- dejagnu.orig/config/gdb-comm.exp
+++ dejagnu/config/gdb-comm.exp
@@ -254,6 +254,7 @@ proc gdb_comm_load { dest prog args } {
 global GDBFLAGS
 global gdb_prompt
 global test_timeout
+global test_timeout_factor
 set argnames { "command-line arguments" "input file" "output file" }
 
 for { set x 0 } { $x < [llength $args] } { incr x } {
@@ -274,6 +275,9 @@ proc gdb_comm_load { dest prog args } {
 } else {
set testcase_timeout 300
 }
+if {[info exists test_timeout_factor]} {
+   set testcase_timeout [expr $testcase_timeout * $test_timeout_factor]
+}
 
 verbose -log "Executing on $dest: $prog (timeout = $testcase_timeout)" 2
 
Index: dejagnu/config/gdb_stub.exp
===
--- dejagnu.orig/config/gdb_stub.exp
+++ dejagnu/config/gdb_stub.exp
@@ -471,6 +471,7 @@ proc gdb_stub_wait { dest timeout } {
 }
 
 proc gdb_stub_load { dest prog args } {
+global test_timeout_factor
 global test_timeout
 global gdb_prompt
 set argnames { "command-line arguments" "input file" "output file" }
@@ -485,6 +486,9 @@ proc gdb_stub_load { dest prog args } {
 if {[info exists test_timeout]} {
set wait_timeout $test_timeout
 }
+if {[info exists test_timeout_factor]} {
+   set wait_timeout [expr $wait_timeout * $test_timeout_factor]
+}
 
 verbose -log "Executing on $dest: $prog (timeout = $wait_timeout)" 2
 
Index: dejagnu/config/sim.exp
===
--- dejagnu.orig/config/sim.exp
+++ dejagnu/config/sim.exp
@@ -60,6 +60,7 @@ proc sim_wait { dest timeout } {
 }
 
 proc sim_load { dest prog args } {
+global test_timeout_factor
 global test_timeout
 
 set inpfile ""
@@ -82,6 +83,9 @@ proc sim_load { dest prog args } {
 } else {
set sim_time_limit 240
 }
+if {[info exists test_timeout_factor]} {
+   set sim_time_limit [expr $sim_time_limit * $test_timeout_factor]
+}
 
 set output ""
 
Index: dejagnu/config/unix.exp
===
--- dejagnu.orig/config/unix.exp
+++ dejagnu/config/unix.exp
@@ -33,6 +33,7 @@ load_lib remote.exp
 
 
 proc unix_load { dest prog args } {
+global test_timeout_factor
 global ld_library_path
 global test_timeout
 set output ""
@@ -42,6 +43,9 @@ proc unix_load { dest prog args } {
 if {[info exists test_timeout]} {
set wait_timeout $test_timeout
 }
+if {[info exists test_timeout_factor]} {
+   set wait_timeout [expr $wait_timeout * $test_timeout_factor]
+}
 
 if { [llength $args] > 0 } {
set parg [lindex $args 0]
Index: dejagnu/doc/dejagnu.texi
===
--- dejagnu.orig/doc/dejagnu.texi
+++ dejagnu/doc/dejagnu.texi
@@ -1363,11 +1363,19 @@ by DejaGnu itself for cross testing, but
 to manipulate these itself.
 
 @vindex test_timeout
+@vindex test_timeout_factor
 The local @file{site.exp} may also set Tcl variables such as
 @code{test_timeout} which can control the amount of time (in seconds)
 to wait for a remote test to complete.  If not specified,
 @code{test_timeout} defaults to 120 or 300 seconds, depending on the
-communication protocol.
+communication

[PATCH GCC 1/1] testsuite: Support test execution timeout factor as a keyword

Add support for the `dg-test-timeout-factor' keyword letting a test
case scale the wait timeout used for code execution, analogously to
`dg-timeout-factor' used for code compilation.  This is useful for
particularly slow test cases for which increasing the wait timeout
globally would be excessive.

gcc/testsuite/
* lib/timeout-dg.exp (dg-test-timeout-factor): New procedure.
---
 gcc/testsuite/lib/timeout-dg.exp |   17 +
 1 file changed, 17 insertions(+)

gcc-test-test-timeout-factor.diff
Index: gcc/gcc/testsuite/lib/timeout-dg.exp
===
--- gcc.orig/gcc/testsuite/lib/timeout-dg.exp
+++ gcc/gcc/testsuite/lib/timeout-dg.exp
@@ -47,3 +47,20 @@ proc dg-timeout-factor { args } {
set timeout_factor [lindex $args 0]
 }
 }
+
+#
+# dg-test-timeout-factor -- Scale the test execution timeout limit
+#
+
+proc dg-test-timeout-factor { args } {
+global test_timeout_factor
+
+set args [lreplace $args 0 0]
+if { [llength $args] > 1 } {
+   if { [dg-process-target [lindex $args 1]] == "S" } {
+   set test_timeout_factor [lindex $args 0]
+   }
+} else {
+   set test_timeout_factor [lindex $args 0]
+}
+}

[PATCH DejaGNU/GCC 0/1] Support per-test execution timeout factor

Hi,

 This patch quasi-series makes it possible for individual test cases 
identified as being slow to request more time via the GCC test harness by 
providing a test execution timeout factor, applied to the tool execution 
timeout set globally for all the test cases.  This is to avoid excessive 
testsuite run times where other test cases do hang as it would be the 
case if the timeout set globally was to be increased.

 The test execution timeout is different from the tool execution timeout 
where it is GCC execution that is being guarded against taking excessive 
amount of time on the test host rather than the resulting test case 
executable run on the target afterwards, as concerned here.  GCC already 
has a `dg-timeout-factor' setting for the tool execution timeout, but has 
no means to increase the test execution timeout.  The GCC side of these 
changes adds a corresponding `dg-test-timeout-factor' setting.

 As the two changes are independent from each other, they can be applied 
in any order with the feature becoming active once both have been placed 
in a given system.  I chose to submit them together so as to give an 
opportunity to both DejaGNU and GCC developers to chime in.

 The DejaGNU side of this patch quasi-series relies on that patch series: 
 to be 
applied first, however I chose to post the two parts separately so as not 
to clutter the GCC mailing list with changes solely for DejaGNU.

 This has been verified with the GCC testsuite in a couple of environments 
using the Unix protocol, both locally and remotely, the GDB stub protocol, 
and the sim protocol, making sure that timeout settings are respected.  I 
found no obvious way to verify the remaining parts, but the changes follow 
the same pattern, so they're expected to behave consistently.

 Let me know if you have any questions, comments or concerns.  Otherwise 
please apply/approve respectively the DejaGNU/GCC side.

  Maciej

Re: [PATCH v2] RISC-V: Supports RISC-V Profiles in '-march' option.

2023-12-12 Thread Christoph Müllner

On Tue, Dec 12, 2023 at 1:08 PM Jiawei  wrote:
>
> Supports RISC-V profiles[1] in -march option.
>
> Default input set the profile is before other formal extensions.
>
> V2: Fixes some format errors and adds code comments for parse function
> Thanks for Jeff Law's review and comments.
>
> [1]https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc
>
> gcc/ChangeLog:
>
> * common/config/riscv/riscv-common.cc (struct riscv_profiles):
>   New struct.
> (riscv_subset_list::parse_profiles): New function.
> (riscv_subset_list::parse): New table.
> * config/riscv/riscv-subset.h: New protype.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/arch-31.c: New test.
> * gcc.target/riscv/arch-32.c: New test.
> * gcc.target/riscv/arch-33.c: New test.
> * gcc.target/riscv/arch-34.c: New test.

For the positive tests (-31.c and -33.c) it would be great to test if
the enabled extension's test macros are set.
Something like this would do:
#if (!(defined __riscv_zicsr) || \
  !(defined __riscv_...))
#error "Feature macros not defined"
#endif

Also, positive tests for RVI20U32 and RVI20U64 would be nice.

>
> ---
>  gcc/common/config/riscv/riscv-common.cc  | 83 +++-
>  gcc/config/riscv/riscv-subset.h  |  2 +
>  gcc/testsuite/gcc.target/riscv/arch-31.c |  5 ++
>  gcc/testsuite/gcc.target/riscv/arch-32.c |  5 ++
>  gcc/testsuite/gcc.target/riscv/arch-33.c |  5 ++
>  gcc/testsuite/gcc.target/riscv/arch-34.c |  7 ++
>  6 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/arch-31.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/arch-32.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/arch-33.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/arch-34.c
>
> diff --git a/gcc/common/config/riscv/riscv-common.cc 
> b/gcc/common/config/riscv/riscv-common.cc
> index 4d5a2f874a2..8b674a4a280 100644
> --- a/gcc/common/config/riscv/riscv-common.cc
> +++ b/gcc/common/config/riscv/riscv-common.cc
> @@ -195,6 +195,12 @@ struct riscv_ext_version
>int minor_version;
>  };
>
> +struct riscv_profiles
> +{
> +  const char *profile_name;
> +  const char *profile_string;
> +};
> +
>  /* All standard extensions defined in all supported ISA spec.  */
>  static const struct riscv_ext_version riscv_ext_version_table[] =
>  {
> @@ -379,6 +385,42 @@ static const struct riscv_ext_version 
> riscv_combine_info[] =
>{NULL, ISA_SPEC_CLASS_NONE, 0, 0}
>  };
>
> +/* This table records the mapping form RISC-V Profiles into march string.  */
> +static const riscv_profiles riscv_profiles_table[] =
> +{
> +  /* RVI20U only contains the base extesnion 'i' as mandatory extension.  */
> +  {"RVI20U64", "rv64i"},
> +  {"RVI20U32", "rv32i"},
> +
> +  /* RVA20U contains the 'i,m,a,f,d,c,zicsr' as mandatory extensions.
> + Currently we don't have zicntr,ziccif,ziccrse,ziccamoa,
> + zicclsm,za128rs yet.   */
> +  {"RVA20U64", "rv64imafdc_zicsr"},
> +
> +  /* RVA20S64 mandatory include all the extensions in RVA20U64 and
> + additonal 'zifencei' as mandatory extensions.
> + Notes that ss1p11, svbare, sv39, svade, sscptr, ssvecd, sstvala should
> + control by binutils.  */
> +  {"RVA20S64", "rv64imafdc_zicsr_zifencei"},
> +
> +  /* RVA22U contains the 'i,m,a,f,d,c,zicsr,zihintpause,zba,zbb,zbs,
> + zicbom,zicbop,zicboz,zfhmin,zkt' as mandatory extensions.
> + Currently we don't have zicntr,zihpm,ziccif,ziccrse,ziccamoa,
> + zicclsm,zic64b,za64rs yet.  */

I would prefer that we implement the missing extensions that start
with 'z' as "dummy" extensions.
I.e., they (currently?) don't affect code generation, but they will be
passed on to the assembler and
will become part of the Tag_RISCV_arch string.

I admit that such "dummy" extensions may not be preferred by
maintainers, but we already
have precedence with Zkt.

I consider an incomplete expansion of a profile as misleading.
And later changes to complete the expansion could be called out as
"breaking changes".

> +  {"RVA22U64", "rv64imafdc_zicsr_zihintpause_zba_zbb_zbs"
>   \
> +   "_zicbom_zicbop_zicboz_zfhmin_zkt"},
> +
> +  /* RVA22S64 mandatory include all the extensions in RVA22U64 and
> + additonal 'zifencei,svpbmt,svinval' as mandatory extensions.
> + Notes that ss1p12, svbare, sv39, svade, sscptr, ssvecd, sstvala,
> + scounterenw extentions should control by binutils.  */

Typo: extentions -> extensions

I want to challenge the implementation of RVA22S64 support
(or in general all S-mode and M-mode profile support) in toolchains:
* Adding 's*'/'m*' extensions as dummy extensions won't have much use
* Having an incomplete extension is misleading (see above)
* I doubt that RVA22S64 would find many users
Therefore, I would not add support for S-mode and M-mode profiles.

> +  {"RVA22S64","rv64imafdc_zicsr_zifencei_zihintpause"
>   \
> +

Re: [PATCH] RISC-V: Refactor Dynamic LMUL codes

2023-12-12 Thread Robin Dapp

Yes, no harm in doing that.  LGTM.

Regards
 Robin

Re: [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load

On Tue, 12 Dec 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> > The following aovids over/under-read of storage when vectorizing
> > a non-grouped load with SLP.  Instead of forcing peeling for gaps
> > use a smaller load for the last vector which might access excess
> > elements.  This builds upon the existing optimization avoiding
> > peeling for gaps, generalizing it to all gap widths leaving a
> > power-of-two remaining number of elements (but it doesn't replace
> > or improve that particular case at this point).
> >
> > I wonder if the poly relational compares I set up are good enough
> > to guarantee /* remain should now be > 0 and < nunits.  */.
> >
> > There is existing test coverage that runs into /* DR will be unused.  */
> > always when the gap is wider than nunits.  Compared to the
> > existing gap == nunits/2 case this only adjusts the load that will
> > cause the overrun at the end, not every load.  Apart from the
> > poly relational compares it should reliably cover these cases but
> > I'll leave it for stage1 to remove.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> > built and tested SPEC CPU 2017.
> >
> > OK?
> >
> > PR tree-optimization/112736
> > * tree-vect-stmts.cc (vectorizable_load): Extend optimization
> > to avoid peeling for gaps to handle single-element non-groups
> > we now allow with SLP.
> >
> > * gcc.dg/torture/pr112736.c: New testcase.
> 
> Mostly LGTM FWIW.  A couple of comments below:
> 
> > ---
> >  gcc/testsuite/gcc.dg/torture/pr112736.c | 27 
> >  gcc/tree-vect-stmts.cc  | 86 -
> >  2 files changed, 96 insertions(+), 17 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c 
> > b/gcc/testsuite/gcc.dg/torture/pr112736.c
> > new file mode 100644
> > index 000..6abb56edba3
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c
> > @@ -0,0 +1,27 @@
> > +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */
> > +
> > +#include 
> > +#include 
> > +
> > +int a, c[3][5];
> > +
> > +void __attribute__((noipa))
> > +fn1 (int * __restrict b)
> > +{
> > +  int e;
> > +  for (a = 2; a >= 0; a--)
> > +for (e = 0; e < 4; e++)
> > +  c[a][e] = b[a];
> > +}
> > +
> > +int main()
> > +{
> > +  long pgsz = sysconf (_SC_PAGESIZE);
> > +  void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE,
> > +  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
> > +  if (p == MAP_FAILED)
> > +return 0;
> > +  mprotect (p, pgsz, PROT_NONE);
> > +  fn1 (p + pgsz);
> > +  return 0;
> > +}
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 390c8472fd6..c03c4c08c9d 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo,
> > if (new_vtype != NULL_TREE)
> >   ltype = half_vtype;
> >   }
> > +   /* Try to use a single smaller load when we are about
> > +  to load accesses excess elements compared to the
> 
> s/accesses //
> 
> > +  unrolled scalar loop.
> > +  ???  This should cover the above case as well.  */
> > +   else if (known_gt ((vec_num * j + i + 1) * nunits,
> > +  (group_size * vf - gap)))
> 
> At first it seemed odd to be using known_gt rather than maybe_gt here,
> given that peeling for gaps is a correctness issue.  But as things stand
> this is just an optimisation, and VLA vectors (to whatever extent they're
> handled by this code) allegedly work correctly without it.  So I agree
> known_gt is correct.  We might need to revisit it when dealing with the
> ??? though.

Yesh, maybe_gt would be needed if for correctness but then ...

> > + {
> > +   if (known_ge ((vec_num * j + i + 1) * nunits
> > + - (group_size * vf - gap), nunits))
> > + /* DR will be unused.  */
> > + ltype = NULL_TREE;
> > +   else if (alignment_support_scheme == dr_aligned)
> > + /* Aligned access to excess elements is OK if
> > +at least one element is accessed in the
> > +scalar loop.  */
> > + ;
> > +   else
> > + {
> > +   auto remain
> > + = ((group_size * vf - gap)
> > +- (vec_num * j + i) * nunits);
> > +   /* remain should now be > 0 and < nunits.  */

... we probably don't know it's < nunits anymore.  Indeed lets revisit
this at some later point.

> > +   unsigned num;
> > +   if (constant_multiple_p (nunits, remain, ))
> > + {
> >

Re: [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load

Richard Biener  writes:
> The following aovids over/under-read of storage when vectorizing
> a non-grouped load with SLP.  Instead of forcing peeling for gaps
> use a smaller load for the last vector which might access excess
> elements.  This builds upon the existing optimization avoiding
> peeling for gaps, generalizing it to all gap widths leaving a
> power-of-two remaining number of elements (but it doesn't replace
> or improve that particular case at this point).
>
> I wonder if the poly relational compares I set up are good enough
> to guarantee /* remain should now be > 0 and < nunits.  */.
>
> There is existing test coverage that runs into /* DR will be unused.  */
> always when the gap is wider than nunits.  Compared to the
> existing gap == nunits/2 case this only adjusts the load that will
> cause the overrun at the end, not every load.  Apart from the
> poly relational compares it should reliably cover these cases but
> I'll leave it for stage1 to remove.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> built and tested SPEC CPU 2017.
>
> OK?
>
>   PR tree-optimization/112736
>   * tree-vect-stmts.cc (vectorizable_load): Extend optimization
>   to avoid peeling for gaps to handle single-element non-groups
>   we now allow with SLP.
>
>   * gcc.dg/torture/pr112736.c: New testcase.

Mostly LGTM FWIW.  A couple of comments below:

> ---
>  gcc/testsuite/gcc.dg/torture/pr112736.c | 27 
>  gcc/tree-vect-stmts.cc  | 86 -
>  2 files changed, 96 insertions(+), 17 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c 
> b/gcc/testsuite/gcc.dg/torture/pr112736.c
> new file mode 100644
> index 000..6abb56edba3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c
> @@ -0,0 +1,27 @@
> +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */
> +
> +#include 
> +#include 
> +
> +int a, c[3][5];
> +
> +void __attribute__((noipa))
> +fn1 (int * __restrict b)
> +{
> +  int e;
> +  for (a = 2; a >= 0; a--)
> +for (e = 0; e < 4; e++)
> +  c[a][e] = b[a];
> +}
> +
> +int main()
> +{
> +  long pgsz = sysconf (_SC_PAGESIZE);
> +  void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE,
> +  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
> +  if (p == MAP_FAILED)
> +return 0;
> +  mprotect (p, pgsz, PROT_NONE);
> +  fn1 (p + pgsz);
> +  return 0;
> +}
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 390c8472fd6..c03c4c08c9d 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo,
>   if (new_vtype != NULL_TREE)
> ltype = half_vtype;
> }
> + /* Try to use a single smaller load when we are about
> +to load accesses excess elements compared to the

s/accesses //

> +unrolled scalar loop.
> +???  This should cover the above case as well.  */
> + else if (known_gt ((vec_num * j + i + 1) * nunits,
> +(group_size * vf - gap)))

At first it seemed odd to be using known_gt rather than maybe_gt here,
given that peeling for gaps is a correctness issue.  But as things stand
this is just an optimisation, and VLA vectors (to whatever extent they're
handled by this code) allegedly work correctly without it.  So I agree
known_gt is correct.  We might need to revisit it when dealing with the
??? though.

> +   {
> + if (known_ge ((vec_num * j + i + 1) * nunits
> +   - (group_size * vf - gap), nunits))
> +   /* DR will be unused.  */
> +   ltype = NULL_TREE;
> + else if (alignment_support_scheme == dr_aligned)
> +   /* Aligned access to excess elements is OK if
> +  at least one element is accessed in the
> +  scalar loop.  */
> +   ;
> + else
> +   {
> + auto remain
> +   = ((group_size * vf - gap)
> +  - (vec_num * j + i) * nunits);
> + /* remain should now be > 0 and < nunits.  */
> + unsigned num;
> + if (constant_multiple_p (nunits, remain, ))
> +   {
> + tree ptype;
> + new_vtype
> +   = vector_vector_composition_type (vectype,
> + num,
> + );
> + if (new_vtype)
> +

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
> > I guess here the problem is floating-point compare instruction is much
> > more costly than other instructions but the fact is not correctly
> > modeled yet.  Could you try
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> > where I've raised fp_add cost (which is used for estimating floating-
> > point compare cost) to 5 instructions and see if it solves your problem
> > without LOGICAL_OP_NON_SHORT_CIRCUIT?
> I think this is not the same issue as the cost of floating-point 
> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT 
> affects how the short-circuit branch, such as (A AND-IF B), is executed, 
> and it is not directly related to the cost of floating-point comparison 
> instructions. I will try to test it using SPECCPU 2017.

The point is if the cost of floating-point comparison is very high, the
middle end *should* short cut floating-point comparisons even if
LOGICAL_OP_NON_SHORT_CIRCUIT = 1.

I've created https://gcc.gnu.org/PR112985.

Another factor regressing the code is we don't have modeled movcf2gr
instruction yet, so we are not really eliding the branches as
LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

Re: [RFC] Intel AVX10.1 Compiler Design and Support

On Tue, Dec 12, 2023 at 10:05 AM Florian Weimer  wrote:
>
> * Richard Biener:
>
> > If it were possible I'd axe x86_64-v4.  Maybe we should add a x86_64-v3.5
> > that sits inbetween v3 and v4, offering AVX512 but restricted to 256bit
> > (and obviously not requiring more of the AVX512 features that v4
> > requires).
>
> As far as I understand it, GCC's Intel tuning for AVX-512 is leaning
> heavily towards 256 bit vector length anyway.

Indeed it does, enabling avx256_optimal everywhere, but enabling
512bit moves on Sapphire Rapids (and not Granite Rapids!?).

>  That's not true for the
> default tuning for -march=x86-64-v4, though, it prefers 512 bit vectors.
> I've seen third-party reports that AMD Zen 4 does better in some ways
> with 512 bit vectors than with 256 bit vectors (despite its 256-bit-wide
> execution ports), but I have not tried to verify these observations.
> Still, this suggests that restricting a post-x86-64-v3 level to 256 bit
> vectors may not be an easy decision.

The corner Intel painted itself to be in is that their small cores on the
hybrid consumer products only support 128bit native (256bit emulated)
and their data-center "small core" SKU doesn't fare any better there.
That's the reason their marketing invented AVX10 which will allow
the AVX512 ISA play "nice" with a smaller data path (but I'm sure, or
at least I hope, that actual implementations will have a native 256bit
data path and not emulate it via 128bit).

The current problem is that the castrated consumer SKUs cannot use
EVEX at all and in the future will be crippled to 256bits.  So that will
be the common thing to target when targeting EVEX support across
Intel/AMD - use 256bit only.  Note that AVX10 excludes Zen4 which
lacks support for two niche AVX512 ISA extensions.

> On the other hand, a new EVEX-capable level might bring earlier adoption
> of EVEX capabilities to AMD CPUs, which still should be an improvement
> over AVX2.  This could benefit AMD as well.  So I would really like to
> see some AMD feedback here.
>
> There's also the matter that time scales for EVEX adoption are so long
> that by then, Intel CPUs may end up supporting and preferring 512 bit
> vectors again.

True, there isn't even widespread VEX adoption yet ... and now there's
APX as the next best thing to target.

That said, my main point was that x86-64-v4 is "broken" as it appears
as a dead end - AVX512 is no more, the future is AVX10, but yet we have
to define x86-64-v5 as something that includes x86-64-v4.

So, can we un-do x86-64-v4?

Richard.

> Thanks,
> Florian
>

Re: Re: [RFC] RISC-V: Support RISC-V Profiles in -march option.

2023-12-12 Thread jiawei

 -原始邮件-
 发件人: "Jeff Law" 
 发送时间: 2023-12-12 00:15:44 (星期二)
 收件人: Jiawei , gcc-patches@gcc.gnu.org
 抄送: kito.ch...@sifive.com, pal...@dabbelt.com, christoph.muell...@vrull.eu
 主题: Re: [RFC] RISC-V: Support RISC-V Profiles in -march option.
 
 
 
 On 11/20/23 12:14, Jiawei wrote:
  Supports RISC-V profiles[1] in -march option.
  
  Default input set the profile is before other formal extensions.
  
  [1]https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc
  
  gcc/ChangeLog:
  
   * common/config/riscv/riscv-common.cc (struct 
riscv_profiles):
 New struct.
   (riscv_subset_list::parse_profiles): New function.
   (riscv_subset_list::parse): New table.
   * config/riscv/riscv-subset.h: New protype.
  
  gcc/testsuite/ChangeLog:
  
   * gcc.target/riscv/arch-29.c: New test.
   * gcc.target/riscv/arch-30.c: New test.
   * gcc.target/riscv/arch-31.c: New test.
  
  ---
gcc/common/config/riscv/riscv-common.cc  | 58 
+++-
gcc/config/riscv/riscv-subset.h  |  2 +
gcc/testsuite/gcc.target/riscv/arch-29.c |  5 ++
gcc/testsuite/gcc.target/riscv/arch-30.c |  5 ++
gcc/testsuite/gcc.target/riscv/arch-31.c |  5 ++
6 files changed, 81 insertions(+), 1 deletion(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/arch-29.c
create mode 100644 gcc/testsuite/gcc.target/riscv/arch-30.c
create mode 100644 gcc/testsuite/gcc.target/riscv/arch-31.c
  
  diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
  index 5111626157b..30617e619b1 100644
  --- a/gcc/common/config/riscv/riscv-common.cc
  +++ b/gcc/common/config/riscv/riscv-common.cc
  @@ -165,6 +165,12 @@ struct riscv_ext_version
  int minor_version;
};

  +struct riscv_profiles
  +{
  +  const char * profile_name;
  +  const char * profile_string;
  +};
 Just a formatting nit, no space between the '*' and the field name.

Fixed.

 
  @@ -348,6 +354,28 @@ static const struct riscv_ext_version 
riscv_combine_info[] =
  {NULL, ISA_SPEC_CLASS_NONE, 0, 0}
};

  +static const riscv_profiles riscv_profiles_table[] =
  +{
  +  {"RVI20U64", "rv64i"},
  +  {"RVI20U32", "rv32i"},
  +  /*Currently we don't have zicntr,ziccif,ziccrse,ziccamoa,
  +zicclsm,za128rs yet.  */
 It is actually useful to note the extensions not included?  I don't 
 think the profiles are supposed to change once ratified.
 
  +  {"RVA22U64", "rv64imafdc_zicsr_zihintpause_zba_zbb_zbs_"   
\
 Note the trailing "_", was that intentional?  None of the other entries 
 have a trailing "_".

Here is a line break due to too long length of arch string,
Adjusted the format in the new patch.

 
 
  @@ -927,6 +955,31 @@ riscv_subset_list::parsing_subset_version (const 
char *ext,
  return p;
}

  +const char *
  +riscv_subset_list::parse_profiles (const char * p){
  +  for (int i = 0; riscv_profiles_table[i].profile_name != NULL; ++i) 
{
  +const char* match = strstr(p, 
riscv_profiles_table[i].profile_name);
  +const char* plus_ext = strchr(p, '+');
  +/* Find profile at the begin.  */
  +if (match != NULL  match == p) {
  +  /* If there's no '+' sign, return the profile_string directly. 
 */
  +  if(!plus_ext)
  + return riscv_profiles_table[i].profile_string;
  +  /* If there's a '+' sign, concatenate profiles with other ext. 
 */
  +  else {
  + size_t arch_len = 
strlen(riscv_profiles_table[i].profile_string) +
  + strlen(plus_ext);
  + static char* result = new char[arch_len + 2];
  + strcpy(result, riscv_profiles_table[i].profile_string);
  + strcat(result, "_");
  + strcat(result, plus_ext + 1); /* skip the '+'.  */
  + return result;
  +  }
  +}
  +  }
  +  return p;
  +}
 This needs a function comment.

Thanks, added the parse function descrption and some deal logical.

 
 The open curly should always be on a line by itself which is going to 
 require reindenting all this code.  Comments go on separate lines rather 
 than appending them to an existing line.
 
 
 I think the consensus in the Tuesday patchwork meeting was that while 
 there are concerns about profiles, those concerns should prevent this 
 patch from going forward.  So if you could fix the formatting problem as 
 well as the trailing "_" issue noted above and repost, it would be 
 appreciated.
 
 Thanks,
 
 Jeff

Thanks for your review and comments, I had update them in the new patch:

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640324.html

BR,
Jiawei

[committed] testsuite: Fix up test directive syntax errors

2023-12-12 Thread Jakub Jelinek

Hi!

I've noticed
+ERROR: gcc.dg/gomp/pr87887-1.c: syntax error in target selector ".-4" for " 
dg-warning 13 "unsupported return type ‘struct S’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 "
+ERROR: gcc.dg/gomp/pr87887-1.c: syntax error in target selector ".-4" for " 
dg-warning 13 "unsupported return type ‘struct S’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 "
+ERROR: gcc.dg/gomp/pr89246-1.c: syntax error in target selector ".-4" for " 
dg-warning 11 "unsupported argument type ‘__int128’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 "
+ERROR: gcc.dg/gomp/pr89246-1.c: syntax error in target selector ".-4" for " 
dg-warning 11 "unsupported argument type ‘__int128’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 "
+ERROR: gcc.dg/gomp/simd-clones-2.c: unmatched open quote in list for " 
dg-final 19 { scan-tree-dump "_ZGVnN2ua32vl_setArray" "optimized { target 
aarch64*-*-* } } "
+ERROR: gcc.dg/gomp/simd-clones-2.c: unmatched open quote in list for " 
dg-final 19 { scan-tree-dump "_ZGVnN2ua32vl_setArray" "optimized { target 
aarch64*-*-* } } "
regressions.  The following patch fixes those.

Tested on x86_64-linux, committed to trunk.

2023-12-12  Jakub Jelinek  

* gcc.dg/gomp/pr87887-1.c: Add missing comment argument to dg-warning.
* gcc.dg/gomp/pr89246-1.c: Likewise.
* gcc.dg/gomp/simd-clones-2.c: Add missing " after dump name.

--- gcc/testsuite/gcc.dg/gomp/pr87887-1.c.jj2023-12-11 23:52:03.761510740 
+0100
+++ gcc/testsuite/gcc.dg/gomp/pr87887-1.c   2023-12-12 13:02:14.831706007 
+0100
@@ -10,7 +10,7 @@ foo (int x)
 {
   return (struct S) { x };
 }
-/* { dg-warning "unsupported return type ‘struct S’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 } */
+/* { dg-warning "unsupported return type ‘struct S’ for ‘simd’ functions" "" { 
target aarch64*-*-* } .-4 } */
 
 #pragma omp declare simd
 int
@@ -18,7 +18,7 @@ bar (struct S x)
 {
   return x.n;
 }
-/* { dg-warning "unsupported argument type ‘struct S’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 } */
+/* { dg-warning "unsupported argument type ‘struct S’ for ‘simd’ functions" "" 
{ target aarch64*-*-* } .-4 } */
 
 #pragma omp declare simd uniform (x)
 int
--- gcc/testsuite/gcc.dg/gomp/pr89246-1.c.jj2023-12-11 23:52:03.768510643 
+0100
+++ gcc/testsuite/gcc.dg/gomp/pr89246-1.c   2023-12-12 13:02:37.375394079 
+0100
@@ -8,7 +8,7 @@ int foo (__int128 x)
 {
   return x;
 }
-/* { dg-warning "unsupported argument type ‘__int128’ for ‘simd’ functions" { 
target aarch64*-*-* } .-4 } */
+/* { dg-warning "unsupported argument type ‘__int128’ for ‘simd’ functions" "" 
{ target aarch64*-*-* } .-4 } */
 
 #pragma omp declare simd
 extern int bar (int x);
--- gcc/testsuite/gcc.dg/gomp/simd-clones-2.c.jj2023-12-11 
23:52:03.768510643 +0100
+++ gcc/testsuite/gcc.dg/gomp/simd-clones-2.c   2023-12-12 13:03:37.654560017 
+0100
@@ -16,12 +16,12 @@ float setArray(float *a, float x, int k)
 }
 /* { dg-final { scan-tree-dump {(?n)^__attribute__\(\(omp declare simd 
\(notinbranch uniform\(0\) aligned\(0:32\) linear\(2:1\)\)\)\)$} "optimized" } 
} */
 
-/* { dg-final { scan-tree-dump "_ZGVnN2ua32vl_setArray" "optimized { target 
aarch64*-*-* } } } */
-/* { dg-final { scan-tree-dump "_ZGVnN4ua32vl_setArray" "optimized { target 
aarch64*-*-* } } } */
-/* { dg-final { scan-tree-dump "_ZGVnN2vvva32_addit" "optimized { target 
aarch64*-*-* } } } */
-/* { dg-final { scan-tree-dump "_ZGVnN4vvva32_addit" "optimized { target 
aarch64*-*-* } } } */
-/* { dg-final { scan-tree-dump "_ZGVnM2vl66u_addit" "optimized { target 
aarch64*-*-* } } } */
-/* { dg-final { scan-tree-dump "_ZGVnM4vl66u_addit" "optimized { target 
aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump "_ZGVnN2ua32vl_setArray" "optimized" { target 
aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump "_ZGVnN4ua32vl_setArray" "optimized" { target 
aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump "_ZGVnN2vvva32_addit" "optimized" { target 
aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump "_ZGVnN4vvva32_addit" "optimized" { target 
aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump "_ZGVnM2vl66u_addit" "optimized" { target 
aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump "_ZGVnM4vl66u_addit" "optimized" { target 
aarch64*-*-* } } } */
 
 /* { dg-final { scan-tree-dump "_ZGVbN4ua32vl_setArray" "optimized" { target 
i?86-*-* x86_64-*-* } } } */
 /* { dg-final { scan-tree-dump "_ZGVbN4vvva32_addit" "optimized" { target 
i?86-*-* x86_64-*-* } } } */


Jakub

[PATCH v2] RISC-V: Supports RISC-V Profiles in '-march' option.

2023-12-12 Thread Jiawei

Supports RISC-V profiles[1] in -march option.

Default input set the profile is before other formal extensions.

V2: Fixes some format errors and adds code comments for parse function
Thanks for Jeff Law's review and comments.

[1]https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc (struct riscv_profiles):
  New struct.
(riscv_subset_list::parse_profiles): New function.
(riscv_subset_list::parse): New table.
* config/riscv/riscv-subset.h: New protype.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/arch-31.c: New test.
* gcc.target/riscv/arch-32.c: New test.
* gcc.target/riscv/arch-33.c: New test.
* gcc.target/riscv/arch-34.c: New test.

---
 gcc/common/config/riscv/riscv-common.cc  | 83 +++-
 gcc/config/riscv/riscv-subset.h  |  2 +
 gcc/testsuite/gcc.target/riscv/arch-31.c |  5 ++
 gcc/testsuite/gcc.target/riscv/arch-32.c |  5 ++
 gcc/testsuite/gcc.target/riscv/arch-33.c |  5 ++
 gcc/testsuite/gcc.target/riscv/arch-34.c |  7 ++
 6 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-31.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-32.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-33.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/arch-34.c

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 4d5a2f874a2..8b674a4a280 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -195,6 +195,12 @@ struct riscv_ext_version
   int minor_version;
 };
 
+struct riscv_profiles
+{
+  const char *profile_name;
+  const char *profile_string;
+};
+
 /* All standard extensions defined in all supported ISA spec.  */
 static const struct riscv_ext_version riscv_ext_version_table[] =
 {
@@ -379,6 +385,42 @@ static const struct riscv_ext_version riscv_combine_info[] 
=
   {NULL, ISA_SPEC_CLASS_NONE, 0, 0}
 };
 
+/* This table records the mapping form RISC-V Profiles into march string.  */
+static const riscv_profiles riscv_profiles_table[] =
+{
+  /* RVI20U only contains the base extesnion 'i' as mandatory extension.  */
+  {"RVI20U64", "rv64i"},
+  {"RVI20U32", "rv32i"},
+
+  /* RVA20U contains the 'i,m,a,f,d,c,zicsr' as mandatory extensions.
+ Currently we don't have zicntr,ziccif,ziccrse,ziccamoa,
+ zicclsm,za128rs yet.   */
+  {"RVA20U64", "rv64imafdc_zicsr"},
+
+  /* RVA20S64 mandatory include all the extensions in RVA20U64 and
+ additonal 'zifencei' as mandatory extensions.
+ Notes that ss1p11, svbare, sv39, svade, sscptr, ssvecd, sstvala should
+ control by binutils.  */
+  {"RVA20S64", "rv64imafdc_zicsr_zifencei"},
+
+  /* RVA22U contains the 'i,m,a,f,d,c,zicsr,zihintpause,zba,zbb,zbs,
+ zicbom,zicbop,zicboz,zfhmin,zkt' as mandatory extensions.
+ Currently we don't have zicntr,zihpm,ziccif,ziccrse,ziccamoa,
+ zicclsm,zic64b,za64rs yet.  */
+  {"RVA22U64", "rv64imafdc_zicsr_zihintpause_zba_zbb_zbs"  
\
+   "_zicbom_zicbop_zicboz_zfhmin_zkt"},
+
+  /* RVA22S64 mandatory include all the extensions in RVA22U64 and
+ additonal 'zifencei,svpbmt,svinval' as mandatory extensions.
+ Notes that ss1p12, svbare, sv39, svade, sscptr, ssvecd, sstvala,
+ scounterenw extentions should control by binutils.  */
+  {"RVA22S64","rv64imafdc_zicsr_zifencei_zihintpause"  
\
+   "_zba_zbb_zbs_zicbom_zicbop_zicboz_zfhmin_zkt_svpbmt_svinval"},
+
+  /* Terminate the list.  */
+  {NULL, NULL}
+};
+
 static const riscv_cpu_info riscv_cpu_tables[] =
 {
 #define RISCV_CORE(CORE_NAME, ARCH, TUNE) \
@@ -958,6 +1000,42 @@ riscv_subset_list::parsing_subset_version (const char 
*ext,
   return p;
 }
 
+/* Parsing RISC-V Profiles in -march string.
+   Return string with mandatory extensions of Profiles.  */
+const char *
+riscv_subset_list::parse_profiles (const char * p){
+  /* Checking if input string contains a Profiles.
+ There are two cases use Proifles in -march option
+
+   1. Only use Proifles as -march input
+   2. Mixed Profiles with other extensions
+
+ use '+' to split Profiles and other extension.  */
+  for (int i = 0; riscv_profiles_table[i].profile_name != NULL; ++i) {
+const char* match = strstr(p, riscv_profiles_table[i].profile_name);
+const char* plus_ext = strchr(p, '+');
+/* Find profile at the begin.  */
+if (match != NULL && match == p) {
+  /* If there's no '+' sign, return the profile_string directly.  */
+  if(!plus_ext)
+   return riscv_profiles_table[i].profile_string;
+  /* If there's a '+' sign, need to add profiles with other ext.  */
+  else {
+   size_t arch_len = strlen(riscv_profiles_table[i].profile_string)+
+ strlen(plus_ext);
+   /* Reset the input string with Profiles mandatory extensions,
+  end with '_'

Re: [PATCH V3 3/4] OpenMP: Use enumerators for names of trait-sets and traits

2023-12-12 Thread Tobias Burnus


Hi Sandra,

On 07.12.23 16:52, Sandra Loosemore wrote:

This patch introduces enumerators to represent trait-set names and
trait names, which makes it easier to use tables to control other
behavior and for switch statements to dispatch on the tags.  The tags
are stored in the same place in the TREE_LIST structure (OMP_TSS_ID or
OMP_TS_ID) and are encoded there as integer constants.


Thanks - that looks like a huge improvement.

* * *

I think it is useful to prepare for 'target_device'. However, it is currently 
not yet implemented
on mainline - contrary to OG13.

Can you add some kind of error diagnostic for it? On mainline, the current 
result is:

error: expected ‘construct’, ‘device’, ‘implementation’ or ‘user’ before 
‘target_device’
   13 | #pragma omp declare variant (f05) match (target_device={kind(gpu)})
  |  ^

But with your patch, it is silently accepted, which is bad.

(That's a modified version of 
gcc/testsuite/c-c++-common/gomp/declare-variant-10.c:13)

I think you have two options:

* Either fail with the same error message as above

* Or update the error message to list 'target_device' (for C/C++/Fortran)
  and handle 'target_device' separately with a sorry.

To whatever you think makes more sense for know, knowing that we do want to add 
'target_device'
in the not to far future.

(I am slightly preferring the updated-error message + sorry variant as it 
avoids touching
the messages later again, but either is fine.)

* * *

Otherwise, the patch LGTM.

As written before, 1/4, 2/4 and 4/4 are LGTM as posted.

Thanks,

Tobias


gcc/ChangeLog
  * omp-selectors.h: New file.
  * omp-general.h: Include omp-selectors.h.
  (OMP_TSS_CODE, OMP_TSS_NAME): New.
  (OMP_TS_CODE, OMP_TS_NAME): New.
  (make_trait_set_selector, make_trait_selector): Adjust declarations.
  (omp_construct_traits_to_codes): Likewise.
  (omp_context_selector_set_compare): Likewise.
  (omp_get_context_selector): Likewise.
  (omp_get_context_selector_list): New.
  * omp-general.cc (omp_construct_traits_to_codes): Pass length in
  as argument instead of returning it.  Make it table-driven.
  (omp_tss_map): New.
  (kind_properties, vendor_properties, extension_properties): New.
  (atomic_default_mem_order_properties): New.
  (omp_ts_map): New.
  (omp_check_context_selector): Simplify lookup and dispatch logic.
  (omp_mark_declare_variant): Ignore variants with unknown construct
  selectors.  Adjust for new representation.
  (make_trait_set_selector, make_trait_selector): Adjust for new
  representations.
  (omp_context_selector_matches): Simplify dispatch logic.  Avoid
  fixed-sized buffers and adjust call to omp_construct_traits_to_codes.
  (omp_context_selector_props_compare): Adjust for new representations
  and simplify dispatch logic.
  (omp_context_selector_set_compare): Likewise.
  (omp_context_selector_compare): Likewise.
  (omp_get_context_selector): Adjust for new representations, and split
  out...
  (omp_get_context_selector_list): New function.
  (omp_lookup_tss_code): New.
  (omp_lookup_ts_code): New.
  (omp_context_compute_score): Adjust for new representations.  Avoid
  fixed-sized buffers and magic numbers.  Adjust call to
  omp_construct_traits_to_codes.
  * gimplify.cc (omp_construct_selector_matches): Avoid use of
  fixed-size buffer.  Adjust call to omp_construct_traits_to_codes.

gcc/c/ChangeLog
  * c-parser.cc (omp_construct_selectors): Delete.
  (omp_device_selectors): Delete.
  (omp_implementation_selectors): Delete.
  (omp_user_selectors): Delete.
  (c_parser_omp_context_selector): Adjust for new representations
  and simplify dispatch logic.  Uniformly warn instead of sometimes
  error when an unknown selector is found.
  (c_parser_omp_context_selector_specification): Likewise.
  (c_finish_omp_declare_variant): Adjust for new representations.

gcc/cp/ChangeLog
  * decl.cc (omp_declare_variant_finalize_one): Adjust for new
  representations.
  * parser.cc (omp_construct_selectors): Delete.
  (omp_device_selectors): Delete.
  (omp_implementation_selectors): Delete.
  (omp_user_selectors): Delete.
  (cp_parser_omp_context_selector): Adjust for new representations
  and simplify dispatch logic.  Uniformly warn instead of sometimes
  error when an unknown selector is found.
  (cp_parser_omp_context_selector_specification): Likewise.
  * pt.cc (tsubst_attribute): Adjust for new representations.

gcc/fortran/ChangeLog
  * gfortran.h: Include omp-selectors.h.
  (enum gfc_omp_trait_property_kind): Delete, and replace all
  references with equivalent omp_tp_type enumerators.
  (struct gfc_omp_trait_property): Update for omp_tp_type.
  (struct gfc_omp_selector): Replace string name with new enumerator.
  (struct

Re: [PATCH] Adjust vectorized cost for reduction.

On Tue, Dec 12, 2023 at 7:12 AM liuhongt  wrote:
>
> x86 doesn't support horizontal reduction instructions, reduc_op_scal_m
> is emulated with vec_extract_half + op(half vector length)
> Take that into account when calculating cost for vectorization.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> No big performance impact on SPEC2017 as measured on ICX.
> Ok for trunk?

I don't think keying on only on vec_to_scalar is good since
vect_model_reduction_cost will always use that when
extracting the scalar result element from the final vector
as well so you'll get double-counting here.

There is currently no good way of identifying the cases
the vectorizer chose reduc_*_scal, this operation
is identified as vector_stmt.

There is STMT_VINFO_REDUC_FN though, but I'm
not 100% positive the stmt_info you get passed has
this set (it's probably on the info_for_reduction node).

It should be possible to invent a new accessor like
vect_reduc_type () computing REDUC_FN though.

Richard.

> gcc/ChangeLog:
>
> PR target/112325
> * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> Handle reduction vec_to_scalar.
> (ix86_vector_costs::ix86_vect_reduc_cost): New function.
> ---
>  gcc/config/i386/i386.cc | 45 +
>  1 file changed, 45 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 4b6bad37c8f..02c9a5004a1 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24603,6 +24603,7 @@ private:
>
>/* Estimate register pressure of the vectorized code.  */
>void ix86_vect_estimate_reg_pressure ();
> +  unsigned ix86_vect_reduc_cost (stmt_vec_info, tree);
>/* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
>   estimation of register pressure.
>   ??? Currently it's only used by vec_construct/scalar_to_vec
> @@ -24845,6 +24846,12 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
> if (TREE_CODE (op) == SSA_NAME)
>   TREE_VISITED (op) = 0;
>  }
> +  /* This is a reduc_*_scal_m, x86 support reduc_*_scal_m with emulation.  */
> +  else if (kind == vec_to_scalar
> +  && stmt_info
> +  && vect_is_reduction (stmt_info))
> +stmt_cost = ix86_vect_reduc_cost (stmt_info, vectype);
> +
>if (stmt_cost == -1)
>  stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> @@ -24875,6 +24882,44 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>return retval;
>  }
>
> +/* x86 doesn't support horizontal reduction instructions,
> +   redc_op_scal_m is emulated with vec_extract_hi + op.  */
> +unsigned
> +ix86_vector_costs::ix86_vect_reduc_cost (stmt_vec_info stmt_info,
> +tree vectype)
> +{
> +  gcc_assert (vectype);
> +  unsigned cost = 0;
> +  machine_mode mode = TYPE_MODE (vectype);
> +  unsigned len = GET_MODE_SIZE (mode);
> +
> +  /* PSADBW is used for reduc_plus_scal_{v16qi, v8qi, v4qi}.  */
> +  if (GET_MODE_INNER (mode) == E_QImode
> +  && stmt_info
> +  && stmt_info->stmt && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
> +  && gimple_assign_rhs_code (stmt_info->stmt) == PLUS_EXPR)
> +{
> +  cost = ix86_cost->sse_op;
> +  /* vec_extract_hi + vpaddb for 256/512-bit reduc_plus_scal_v*qi.  */
> +  if (len > 16)
> +   cost += exact_log2 (len >> 4) * ix86_cost->sse_op * 2;
> +}
> +  else
> +/* vec_extract_hi + op.  */
> +cost = ix86_cost->sse_op * exact_log2 (TYPE_VECTOR_SUBPARTS (vectype)) * 
> 2;
> +
> +  /* Cout extra uops for TARGET_*_SPLIT_REGS. NB: There's no target which
> + supports 512-bit vector but has TARGET_AVX256/128_SPLIT_REGS.
> + ix86_vect_cost is not used since reduction instruction sequence are
> + consisted with mixed vector-length instructions after vec_extract_hi.  
> */
> +  if ((len == 64 && TARGET_AVX512_SPLIT_REGS)
> +  || (len == 32 && TARGET_AVX256_SPLIT_REGS)
> +  || (len == 16 && TARGET_AVX256_SPLIT_REGS))
> +cost += ix86_cost->sse_op;
> +
> +  return cost;
> +}
> +
>  void
>  ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
>  {
> --
> 2.31.1
>

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

2023-12-12 Thread Jiahao Xu




在 2023/12/12 下午7:26, Xi Ruoyao 写道:

On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote:

Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
short-circuit operation instead of the non-short-circuit operation.

This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.

In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the
default value (1 for all current LoongArch CPUs with branch_cost = 6)
may reduce the number of conditional branch instructions.

I guess here the problem is floating-point compare instruction is much
more costly than other instructions but the fact is not correctly
modeled yet.  Could you try
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
where I've raised fp_add cost (which is used for estimating floating-
point compare cost) to 5 instructions and see if it solves your problem
without LOGICAL_OP_NON_SHORT_CIRCUIT?
I think this is not the same issue as the cost of floating-point 
comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT 
affects how the short-circuit branch, such as (A AND-IF B), is executed, 
and it is not directly related to the cost of floating-point comparison 
instructions. I will try to test it using SPECCPU 2017.

If not I guess you can try increasing the floating-point comparison cost
more in loongarch_rtx_costs:

 case UNLT:
   /* Branch comparisons have VOIDmode, so use the first operand's
  mode instead.  */
   mode = GET_MODE (XEXP (x, 0));
   if (FLOAT_MODE_P (mode))
 {
   *total = loongarch_cost->fp_add;


Try to make it fp_add + something?

   return false;
 }
   *total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4),
   speed);
   return true;


If adjusting the cost model does not work I'd say this is a middle-end
issue and we should submit a bug report.


gcc/ChangeLog:

* config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/short-circuit.c: New test.

diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
index f1350b6048f..880c576c35b 100644
--- a/gcc/config/loongarch/loongarch.h
+++ b/gcc/config/loongarch/loongarch.h
@@ -869,6 +869,7 @@ typedef struct {
     1 is the default; other values are interpreted relative to that.  */
  
  #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost

+#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
  
  /* Return the asm template for a conditional branch instruction.

     OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c 
b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
new file mode 100644
index 000..bed585ee172
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
+
+int
+short_circuit (float *a)
+{
+  float t1x = a[0];
+  float t2x = a[1];
+  float t1y = a[2];
+  float t2y = a[3];
+  float t1z = a[4];
+  float t2z = a[5];
+
+  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y < 
t1z)
+    return 0;
+
+  return 1;
+}
+/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */

Re: [PATCH #1/2] strub: handle volatile promoted args in internal strub [PR112938]