RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut optimization

2023-04-21 Thread Li, Pan2 via Gcc-patches
Hi Kito

Thanks for the suggestion. Sorry for late response due to stuck in the rest rvv 
test files auto generation.

I have similar discuss with juzhe for this approach, and take Patch v2's way 
due to the below concern.

1. The vector.md Is quite complicated already, the maintenance may be out of 
control if we will add many new define_insn_and_split for the shortcut.
2. The new added pattern may not friendly for the underlying auto-vectorization.

Juzhe can help to correct me if any misleading.

Pan

-Original Message-
From: Kito Cheng  
Sent: Friday, April 21, 2023 9:02 PM
To: Li, Pan2 
Cc: juzhe.zh...@rivai.ai; gcc-patches ; Kito.cheng 
; Wang, Yanzhang 
Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
optimization

Hi Pan:

One idea come to my mind, maybe we should add a new define_insn_and_split 
pattern instead of change @pred_mov

On Fri, Apr 21, 2023 at 7:17 PM Li, Pan2 via Gcc-patches 
 wrote:
>
> Thanks kito, will try to reproduce this issue and keep you posted.
>
> Pan
>
> -Original Message-
> From: Kito Cheng 
> Sent: Friday, April 21, 2023 6:17 PM
> To: Li, Pan2 
> Cc: juzhe.zh...@rivai.ai; gcc-patches ; 
> Kito.cheng ; Wang, Yanzhang 
> 
> Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> I got a bunch of new fails including ICE for gcc testsuite, and some cases 
> are hanging there, could you take a look?
>
> $ riscv64-unknown-linux-gnu-gcc
> gcc.target/riscv/rvv/vsetvl/avl_single-92.c -O2 -march=rv32gcv
> -mabi=ilp32
> during RTL pass: expand
> /scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:
> In function 'f':
> /scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:8:13:
> internal compiler error: in maybe_gen_insn, at optabs.cc:8102
> 8 |   vbool64_t mask = *(vbool64_t*) (in + 100);
>   | ^~~~
> 0x130d278 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
> ../../../../riscv-gnu-toolchain-trunk/gcc/gcc/optabs.cc:8102
>
>
> On Fri, Apr 21, 2023 at 5:47 PM Li, Pan2 via Gcc-patches 
>  wrote:
> >
> > Kindly ping for the PATCH v2. Just FYI there will be some underlying 
> > investigation based on this PATCH like VMSEQ.
> >
> > Pan
> >
> > -Original Message-
> > From: Li, Pan2
> > Sent: Wednesday, April 19, 2023 7:27 PM
> > To: 'Kito Cheng' ; 'juzhe.zh...@rivai.ai'
> > 
> > Cc: 'gcc-patches' ; 'Kito.cheng'
> > ; Wang, Yanzhang 
> > Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) 
> > shortcut optimization
> >
> > Update the Patch v2 for more detail information for clarification. Please 
> > help to review continuously.
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616175.html
> >
> > Pan
> >
> > -Original Message-
> > From: Li, Pan2
> > Sent: Wednesday, April 19, 2023 6:33 PM
> > To: Kito Cheng ; juzhe.zh...@rivai.ai
> > Cc: gcc-patches ; Kito.cheng 
> > ; Wang, Yanzhang 
> > Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) 
> > shortcut optimization
> >
> > Sure thing.
> >
> > For Changlog, I consider it was generated automatically in previous. LOL.
> >
> > Pan
> >
> > -Original Message-
> > From: Kito Cheng 
> > Sent: Wednesday, April 19, 2023 5:46 PM
> > To: juzhe.zh...@rivai.ai
> > Cc: Li, Pan2 ; gcc-patches 
> > ; Kito.cheng ; Wang, 
> > Yanzhang 
> > Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) 
> > shortcut optimization
> >
> > HI JuZhe:
> >
> > Thanks for explaining!
> >
> >
> > Hi Pan:
> >
> > I think that would be helpful if JuZhe's explaining that could be written 
> > into the commit log.
> >
> >
> > > gcc/ChangeLog:
> > >
> > >* config/riscv/riscv-v.cc (emit_pred_op):
> > >* config/riscv/riscv-vector-builtins-bases.cc:
> > >* config/riscv/vector.md:
> >
> > And don't forgot write some thing in ChangeLog...:P


[r14-159 Regression] FAIL: std/ranges/iota/max_size_type.cc execution test on Linux/x86_64

2023-04-21 Thread haochen.jiang via Gcc-patches
On Linux/x86_64,

03cebd304955a6b9c5607e09312d77f1307cc98e is the first bad commit
commit 03cebd304955a6b9c5607e09312d77f1307cc98e
Author: Jason Merrill 
Date:   Tue Apr 18 21:32:07 2023 -0400

c++: fix 'unsigned typedef-name' extension [PR108099]

caused

FAIL: std/ranges/iota/max_size_type.cc execution test

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r14-159/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/x86_64-linux/libstdc++-v3/testsuite && make check 
RUNTESTFLAGS="conformance.exp=std/ranges/iota/max_size_type.cc 
--target_board='unix{-m32}'"
$ cd {build_dir}/x86_64-linux/libstdc++-v3/testsuite && make check 
RUNTESTFLAGS="conformance.exp=std/ranges/iota/max_size_type.cc 
--target_board='unix{-m32\ -march=cascadelake}'"
$ cd {build_dir}/x86_64-linux/libstdc++-v3/testsuite && make check 
RUNTESTFLAGS="conformance.exp=std/ranges/iota/max_size_type.cc 
--target_board='unix{-m64}'"
$ cd {build_dir}/x86_64-linux/libstdc++-v3/testsuite && make check 
RUNTESTFLAGS="conformance.exp=std/ranges/iota/max_size_type.cc 
--target_board='unix{-m64\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com)


Re: [GCC14 QUEUE PATCH] RISC-V: Optimize fault only first load

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/29/23 19:28, juzhe.zh...@rivai.ai wrote:

From: Juzhe-Zhong 

gcc/ChangeLog:

 * config/riscv/riscv-vsetvl.cc (pass_vsetvl::cleanup_insns): Adapt 
PASS.
This doesn't provide any useful information as far as I can tell. 
Perhaps something like:

Erase AVL from instructions with the fault first load property.

OK with a better ChangeLog entry.

Related.  As a separate patch, can you add a function comment to 
cleanup_insns?  It doesn't have one and it should.


Thanks,
jeff


Re: [GCC14 QUEUE PATCH] RISC-V: Eliminate redundant vsetvli for duplicate AVL def

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/27/23 19:01, juzhe.zh...@rivai.ai wrote:

From: Juzhe-Zhong 

void f (int8_t* base1,int8_t* base2,int8_t* out,int n)
{
  vint8mf4_t v = __riscv_vle8_v_i8mf4 (base1, 32);
  for (int i = 0; i < n; i++){
v = __riscv_vor_vx_i8mf4 (v, 101, 32);
v = __riscv_vle8_v_i8mf4_tu (v, base2, 32);
  }
  __riscv_vse8_v_i8mf4 (out, v, 32);
}

before this patch:
f:
li  a5,32
vsetvli zero,a5,e8,mf4,tu,ma
vle8.v  v1,0(a0)
ble a3,zero,.L2
li  t0,0
li  a0,101
.L3:
addiw   t0,t0,1
vor.vx  v1,v1,a0
vle8.v  v1,0(a1)
bne a3,t0,.L3
.L2:
vsetvli zero,zero,e8,mf4,tu,ma
vse8.v  v1,0(a2)
ret


afther this patch:

f:
li  a5,32
vsetvli zero,a5,e8,mf4,tu,ma
vle8.v  v1,0(a0)
ble a3,zero,.L2
li  t0,0
li  a0,101
.L3:
addiw   t0,t0,1
vor.vx  v1,v1,a0
vle8.v  v1,0(a1)
bne a3,t0,.L3
.L2:
vse8.v  v1,0(a2)
ret

gcc/ChangeLog:

 * config/riscv/riscv-vsetvl.cc 
(vector_infos_manager::all_avail_in_compatible_p): New function.
 (pass_vsetvl::refine_vsetvls): Remove redundant vsetvli.
 * config/riscv/riscv-vsetvl.h: New function declare.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/vsetvl/avl_single-102.c: New test.

---
  gcc/config/riscv/riscv-vsetvl.cc  | 67 ++-
  gcc/config/riscv/riscv-vsetvl.h   |  1 +
  .../riscv/rvv/vsetvl/avl_single-102.c | 16 +
  3 files changed, 81 insertions(+), 3 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-102.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 4948e5d4c5e..58568b45010 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2376,6 +2376,23 @@ vector_infos_manager::all_empty_predecessor_p (const 
basic_block cfg_bb) const
return true;
  }
  
+bool

+vector_infos_manager::all_avail_in_compatible_p (const basic_block cfg_bb) 
const

This needs a function comment.  Perhaps:

/* Return TRUE if the incoming vector configuration state
   to CFG_BB is compatible with the vector configuration
   state in CFG_BB, FALSE otherwise.  */



+
+  /* Optimize such case:
+   void f (int8_t* base1,int8_t* base2,int8_t* out,int n)
+   {
+ vint8mf4_t v = __riscv_vle8_v_i8mf4 (base1, 32);
+ for (int i = 0; i < n; i++){
+   v = __riscv_vor_vx_i8mf4 (v, 101, 32);
+   v = __riscv_vle8_v_i8mf4_tu (v, base2, 32);
+ }
+ __riscv_vse8_v_i8mf4 (out, v, 32);
+   }
In general I would suggest rather than writing code like this in the 
comments, instead describe the properties you're looking for.  That way 
someone who may not be a RISC-V expert can more easily interpret the 
scenario you're looking for and what action you want to take when the 
scenario is discovered.


In this particular case it look like you're trying to describe the 
scenario where all incoming edges to a block have a vector state that is 
compatbile with the block.  In such a case we need not emit a vsetvl in 
the current block.


THe right place for code is in the testsuite.

So generally OK, though you do need to adjust the comments slightly. 
Please do that and repost for a final review/ACK.


Thanks,

Jeff



Re: [PATCH] RISC-V: Fix PR108270

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/27/23 00:59, juzhe.zh...@rivai.ai wrote:

From: Juzhe-Zhong 

 PR 108270

Fix bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.

Consider the following testcase:
void f (void * restrict in, void * restrict out, int l, int n, int m)
{
   for (int i = 0; i < l; i++){
 for (int j = 0; j < m; j++){
   for (int k = 0; k < n; k++)
 {
   vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
   __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
 }
 }
   }
}

Compile option: -O3

Before this patch:
mv  a7,a2
mv  a6,a0   
 mv t1,a1
mv  a2,a3
vsetivlizero,17,e8,mf8,ta,ma
...

After this patch:
 mv  a7,a2
 mv  a6,a0
 mv  t1,a1
 mv  a2,a3
 ble a7,zero,.L1
 ble a4,zero,.L1
 ble a3,zero,.L1
 add a1,a0,a4
 li  a0,0
 vsetivlizero,17,e8,mf8,ta,ma
...

It will produce potential bug when:

int main ()
{
   vsetivli zero, 100,.
   f (in, out, 0,0,0)
   asm volatile ("csrr a0,vl":::"memory");

   // Before this patch the a0 is 17. (Wrong).
   // After this patch the a0 is 100. (Correct).
   ...
}

gcc/ChangeLog:

 * config/riscv/riscv-vsetvl.cc 
(vector_infos_manager::all_empty_predecessor_p): New function.
 (pass_vsetvl::backward_demand_fusion): Fix bug.
 * config/riscv/riscv-vsetvl.h: New function declare.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt test.
 * gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Adapt test.
 * gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.

---
  gcc/config/riscv/riscv-vsetvl.cc  | 24 +++
  gcc/config/riscv/riscv-vsetvl.h   |  2 ++
  .../riscv/rvv/vsetvl/imm_bb_prop-1.c  |  2 +-
  .../riscv/rvv/vsetvl/imm_conflict-3.c |  4 ++--
  .../gcc.target/riscv/rvv/vsetvl/pr108270.c| 19 +++
  5 files changed, 48 insertions(+), 3 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr108270.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index b5f5301ea43..4948e5d4c5e 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2361,6 +2361,21 @@ vector_infos_manager::all_same_ratio_p (sbitmap bitdata) 
const
return true;
  }
  
+bool

+vector_infos_manager::all_empty_predecessor_p (const basic_block cfg_bb) const

Needs a function comment.  Perhaps something like:

/* Return TRUE if CFG_BB's predecessors have no vector configuration
   state.  FALSE otherwise.  */

Which I think argues that the name isn't good.  Perhaps 
"no_vector_state_in_preds" would be a better name?




  
+  /* Fix PR108270:

+
+   bb 0 -> bb 1
+We don't need to backward fuse VL/VTYPE info from bb 1 to bb 0
+if bb 1 is not inside a loop and all predecessors of bb 0 are empty. */
+  if (m_vector_manager->all_empty_predecessor_p (cfg_bb))
+   continue;
Rather than "empty" I would say something about vector configuration 
state.  "empty" is much more likely to be interpreted as having no 
instructions or something similar, which isn't the property you're checking.




So I think making the minor comment/name changes and this will be fine. 
Please repost it though for a final ACK.


jeff


Re: [PATCH] vect: Verify that GET_MODE_NUNITS is greater than one.

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/14/23 15:52, Michael Collison wrote:

While working on autovectorizing for the RISCV port I encountered an issue
where can_duplicate_and_interleave_p assumes that GET_MODE_NUNITS is a
evenly divisible by two. The RISC-V target has vector modes (e.g. VNx1DImode),
where GET_MODE_NUNITS is equal to one.

Tested on RISCV and x86_64-linux-gnu. Okay?

2023-03-09  Michael Collison  

* tree-vect-slp.cc (can_duplicate_and_interleave_p):
Check that GET_MODE_NUNITS is greater than one.
Is this still relevant?   I know other changes were made to deal with 
the case where GET_MODE_NUNITS returns 1, but I don't know if they made 
this obsolete.


Any chance we could get a testcase for this?  I realize it might depend 
on unmerged RVV bits.



jeff


Re: [PATCH] RISC-V: Optimize zbb ins sext.b and sext.h in rv64

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/23/23 19:53, Feng Wang wrote:

This patch optimize the combine processing for sext.b/h in rv64.
Please refer to the following test case,

[ ... ]
I've opened BZ109592 to track this problem.

jeff


Re: [PATCH] RISC-V: Optimize zbb ins sext.b and sext.h in rv64

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/23/23 19:53, Feng Wang wrote:

This patch optimize the combine processing for sext.b/h in rv64.
Please refer to the following test case,
int sextb32(int x)
{ return (x << 24) >> 24; }

The rtl expression is as follows,
(insn 6 3 7 2 (set (reg:SI 138)
 (ashift:SI (subreg/s/u:SI (reg/v:DI 136 [ xD.2271 ]) 0)
 (const_int 24 [0x18]))) "sextb.c":2:13 195 {ashlsi3}
  (expr_list:REG_DEAD (reg/v:DI 136 [ xD.2271 ])
 (nil)))
(insn 7 6 8 2 (set (reg:SI 137)
 (ashiftrt:SI (reg:SI 138)
 (const_int 24 [0x18]))) "sextb.c":2:20 196 {ashrsi3}
  (expr_list:REG_DEAD (reg:SI 138)
 (nil)))

During the combine phase, they will combine into
(set (reg:SI 137)
 (ashiftrt:SI (subreg:SI (ashift:DI (reg:DI 140)
 (const_int 24 [0x18])) 0)
 (const_int 24 [0x18])))

The optimal combine result is
(set (reg:SI 137)
 (sign_extend:SI (subreg:QI (reg:DI 140) 0)))
This can be converted to the sext ins.

Due to the influence of subreg,the current processing
can't obtain the imm of left shifts. Need to peel off
another layer of rtl to obtain it.

gcc/ChangeLog:

 * combine.cc (extract_left_shift): Add SUBREG case.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/zbb-sext-rv64.c: New test.
SUBREGs have painful semantics and we should be very careful just 
stripping them.


For example, you might have a subreg that extracts the *high* part.  Or 
you might have (subreg (mem)) or a paradoxical subreg, etc.


At the *least* this case would need verification that you're getting the 
lowpart.  However, I suspect there's other conditions that need to be 
checked to make this valid.


But I would suggest we look elsewhere.  It could be that combine is 
reassociating the subreg in ways that are undesirable and which 
ultimately makes our job harder. Additionally if we can fix this in a 
generic simplification/folder routine, then multiple passes can benefit.


For example in simplify_context::simplify_binary_operation we get a form 
more amenable to optimization.


#0  simplify_context::simplify_binary_operation (this=0x7fffda68, code=ASHIFTRT, mode=E_SImode, 
op0=0x7fffea11eb40, op1=0x7fffea009610) at /home/jlaw/riscv-persist/ventana/gcc/gcc/simplify-rtx.cc:2558

2558  gcc_assert (GET_RTX_CLASS (code) != RTX_COMPARE);
(gdb) p code
$24 = ASHIFTRT
(gdb) p mode
$25 = E_SImode
(gdb) p debug_rtx (op0)
(ashift:SI (subreg/s/u:SI (reg/v:DI 74 [ x ]) 0)
(const_int 24 [0x18]))
$26 = void
(gdb) p debug_rtx (op1)
(const_int 24 [0x18])
$27 = void


So that's (ashiftrt (ashift (object) 24) 24), ie sign extension.

ie, we really don't have to think about the fact that the underlying 
object is a SUBREG because the outer operations are very clearly a sign 
extension regardless of the object they're operating on.


With that in mind I would suggest you look at adding a case for detect 
zero/sign extension in simplify_context::simplify_binary_operation_1.


Thanks,
Jeff


Re: [PATCH v1] [RFC] Improve folding for comparisons with zero in tree-ssa-forwprop.

2023-04-21 Thread Philipp Tomsich
Any guidance on the next steps for this patch?
I believe that we answered all open questions, but may have missed something.

With trunk open for new development, we would like to revise and land this…

Thanks,
Philipp.

On Mon, 20 Mar 2023 at 15:02, Manolis Tsamis  wrote:
>
> On Fri, Mar 17, 2023 at 10:31 AM Richard Biener
>  wrote:
> >
> > On Thu, Mar 16, 2023 at 4:27 PM Manolis Tsamis  
> > wrote:
> > >
> > > For this C testcase:
> > >
> > > void g();
> > > void f(unsigned int *a)
> > > {
> > >   if (++*a == 1)
> > > g();
> > > }
> > >
> > > GCC will currently emit a comparison with 1 by using the value
> > > of *a after the increment. This can be improved by comparing
> > > against 0 and using the value before the increment. As a result
> > > there is a potentially shorter dependancy chain (no need to wait
> > > for the result of +1) and on targets with compare zero instructions
> > > the generated code is one instruction shorter.
> >
> > The downside is we now need two registers and their lifetime overlaps.
> >
> > Your patch mixes changing / inverting a parameter (which seems unneeded
> > for the actual change) with preferring compares against zero.
> >
>
> Indeed. I thought that without that change the original names wouldn't 
> properly
> describe what the parameter actually does and that's why I've changed it.
> I can undo that in the next revision.
>
> > What's the reason to specifically prefer compares against zero?  On x86
> > we have add that sets flags, so ++*a == 0 would be preferred, but
> > for your sequence we'd need a test reg, reg; branch on zero, so we do
> > not save any instruction.
> >
>
> My reasoning is that zero is treated preferentially  in most if not
> all architectures. Some specifically have zero/non-zero comparisons so
> we get one less instruction. X86 doesn't explicitly have that but I
> think that test reg, reg may not be always needed depending on the
> rest of the code. By what Andrew mentions below there may even be
> optimizations for zero in the microarchitecture level.
>
> Because this is still an arch-specific thing I initially tried to make
> it arch-depended by invoking the target's const functions (e.g. If I
> recall correctly aarch64 will return a lower cost for zero
> comparisons). But the code turned out complicated and messy so I came
> up with this alternative that just treats zero preferentially.
>
> If you have in mind a way that this can be done in a better way I
> could try to implement it.
>
> > We do have quite some number of bugreports with regards to making VRPs
> > life harder when splitting things this way.  It's easier for VRP to handle
> >
> >   _1 = _2 + 1;
> >   if (_1 == 1)
> >
> > than it is
> >
> >   _1 = _2 + 1;
> >   if (_2 == 0)
> >
> > where VRP fails to derive a range for _1 on the _2 == 0 branch.  So besides
> > the life-range issue there's other side-effects as well.  Maybe ranger 
> > meanwhile
> > can handle the above case?
> >
>
> Answered by Andrew MacLeod.
>
> > What's the overall effect of the change on a larger code base?
> >
>
> I made some quick runs of SPEC2017 and got the following results (# of
> folds of zero comparisons):
>
>  gcc 2586
>  xalancbmk 1456
>  perlbench   375
>  x264   307
>  omnetpp 137
>  leela   24
>  deepsjeng  15
>  exchange2  4
>  xz4
>
> My test runs on Aarch64 do not show any significant change in runtime.
> In some cases (e.g. gcc) the binary is smaller in size, but that can
> depend on a number of other things.
>
> Thanks,
> Manolis
>
> > Thanks,
> > Richard.
> >
> > >
> > > Example from Aarch64:
> > >
> > > Before
> > > ldr w1, [x0]
> > > add w1, w1, 1
> > > str w1, [x0]
> > > cmp w1, 1
> > > beq .L4
> > > ret
> > >
> > > After
> > > ldr w1, [x0]
> > > add w2, w1, 1
> > > str w2, [x0]
> > > cbz w1, .L4
> > > ret
> > >
> > > gcc/ChangeLog:
> > >
> > > * tree-ssa-forwprop.cc (combine_cond_expr_cond):
> > > (forward_propagate_into_comparison_1): Optimize
> > > for zero comparisons.
> > >
> > > Signed-off-by: Manolis Tsamis 
> > > ---
> > >
> > >  gcc/tree-ssa-forwprop.cc | 41 +++-
> > >  1 file changed, 28 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/gcc/tree-ssa-forwprop.cc b/gcc/tree-ssa-forwprop.cc
> > > index e34f0888954..93d5043821b 100644
> > > --- a/gcc/tree-ssa-forwprop.cc
> > > +++ b/gcc/tree-ssa-forwprop.cc
> > > @@ -373,12 +373,13 @@ rhs_to_tree (tree type, gimple *stmt)
> > >  /* Combine OP0 CODE OP1 in the context of a COND_EXPR.  Returns
> > > the folded result in a form suitable for COND_EXPR_COND or
> > > NULL_TREE, if there is no suitable simplified form.  If
> > > -   INVARIANT_ONLY is true only gimple_min_invariant results are
> > > -   considered simplified.  */
> > > +   ALWAYS_COMBINE is false then only combine it the 

Re: [PATCH] Implement range-op entry for sin/cos.

2023-04-21 Thread Jakub Jelinek via Gcc-patches
On Fri, Apr 21, 2023 at 10:43:44PM +0200, Mikael Morin wrote:
> Hello,
> 
> > --- gcc/gimple-range-op.cc.jj   2023-04-21 17:09:48.250367999 +0200
> > +++ gcc/gimple-range-op.cc  2023-04-21 18:37:26.048325391 +0200
> > @@ -439,20 +436,38 @@ public:
> > r.set_varying (type);
> > return true;
> > }
> > +
> >   // Results outside of [-1.0, +1.0] are impossible.
> >   REAL_VALUE_TYPE lb = lhs.lower_bound ();
> >   REAL_VALUE_TYPE ub = lhs.upper_bound ();
> > -if (real_less (, )
> > -   || real_less (, ))
> > +if (real_less (, ) || real_less (, ))
> > {
> 
> Shouldn't lb and ub be swapped in this condition?

Yes, they should.

> If I understand correctly, we are looking for ranges like [whatever,x] where
> x < -1.0 or [y, whatever] where 1.0 < y.

Jakub



Re: [PATCH] Implement range-op entry for sin/cos.

2023-04-21 Thread Mikael Morin

Hello,


--- gcc/gimple-range-op.cc.jj   2023-04-21 17:09:48.250367999 +0200
+++ gcc/gimple-range-op.cc  2023-04-21 18:37:26.048325391 +0200
@@ -439,20 +436,38 @@ public:
r.set_varying (type);
return true;
}
+
  // Results outside of [-1.0, +1.0] are impossible.
  REAL_VALUE_TYPE lb = lhs.lower_bound ();
  REAL_VALUE_TYPE ub = lhs.upper_bound ();
-if (real_less (, )
-   || real_less (, ))
+if (real_less (, ) || real_less (, ))
{


Shouldn't lb and ub be swapped in this condition?
If I understand correctly, we are looking for ranges like [whatever,x] 
where x < -1.0 or [y, whatever] where 1.0 < y.


Mikael


Re: [PATCH] RISC-V: Fix redundant vmv1r.v instruction in vmsge.vx codegen

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/22/23 06:15, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

Current expansion of vmsge will make RA produce redundant vmv1r.v.

testcase:
void f1 (void * in, void *out, int32_t x)
{
 vbool32_t mask = *(vbool32_t*)in;
 asm volatile ("":::"memory");
 vint32m1_t v = __riscv_vle32_v_i32m1 (in, 4);
 vint32m1_t v2 = __riscv_vle32_v_i32m1_m (mask, in, 4);
 vbool32_t m3 = __riscv_vmsge_vx_i32m1_b32 (v, x, 4);
 vbool32_t m4 = __riscv_vmsge_vx_i32m1_b32_mu (mask, m3, v, x, 4);
 m4 = __riscv_vmsge_vv_i32m1_b32_m (m4, v2, v2, 4);
 __riscv_vsm_v_b32 (out, m4, 4);
}

Before this patch:
f1:
 vsetvli a5,zero,e8,mf4,ta,ma
 vlm.v   v0,0(a0)
 vsetivlizero,4,e32,m1,ta,mu
 vle32.v v3,0(a0)
 vle32.v v2,0(a0),v0.t
 vmslt.vxv1,v3,a2
 vmnot.m v1,v1
 vmslt.vxv1,v3,a2,v0.t
 vmxor.mmv1,v1,v0
 vmv1r.v v0,v1
 vmsge.vvv2,v2,v2,v0.t
 vsm.v   v2,0(a1)
 ret

After this patch:
f1:
 vsetvli a5,zero,e8,mf4,ta,ma
 vlm.v   v0,0(a0)
 vsetivlizero,4,e32,m1,ta,mu
 vle32.v v3,0(a0)
 vle32.v v2,0(a0),v0.t
 vmslt.vxv1,v3,a2
 vmnot.m v1,v1
 vmslt.vxv1,v3,a2,v0.t
 vmxor.mmv0,v1,v0
 vmsge.vvv2,v2,v2,v0.t
 vsm.v   v2,0(a1)
 ret


gcc/ChangeLog:

 * config/riscv/vector.md: Fix redundant vmv1r.v.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/base/binop_vx_constraint-150.c: Adapt assembly 
check.

OK.  Please push this to the trunk.

jeff


Re: [PATCH] RISC-V: Fine tune vmadc/vmsbc RA constraint

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/16/23 03:39, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

gcc/ChangeLog:

 * config/riscv/vector.md: Fix bug of vmsbc

OK.  Please install on the trunk.

jeff


Re: [PATCH] riscv: thead: Add sign/zero extension support for th.ext and th.extu

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/15/23 06:24, Christoph Muellner wrote:

From: Christoph Müllner 

The current support of the bitfield-extraction instructions
th.ext and th.extu (XTheadBb extension) only covers sign_extract
and zero_extract. This patch add support for sign_extend and
zero_extend to avoid any shifts for sign or zero extensions.

gcc/ChangeLog:

* config/riscv/riscv.md:
* config/riscv/thead.md (*extend2_th_ext):
(*zero_extendsidi2_th_extu):
(*zero_extendhi2_th_extu):

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadbb-ext-1.c: New test.
* gcc.target/riscv/xtheadbb-extu-1.c: New test.

OK.  Though the main part of the ChangeLog needs some content ;-)

jeff


Re: [PATCH] RISC-V: Fine tune gather load RA constraint

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/13/23 02:28, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

For DEST EEW < SOURCE EEW, we can partial overlap register
according to RVV ISA.

gcc/ChangeLog:

 * config/riscv/vector.md: Fix RA constraint.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.

This is OK.

The one question I keep having when I read these patterns is why we have 
the earlyclobber.


Earlyclobber means that the output is potentially written before the 
inputs are consumed.   Typically for a single instruction pattern such 
constraints wouldn't make a lot of sense as *usually* the inputs are 
consumed before the output is written.


Just looking for a clarification as to why the earlyclobbers are needed 
at all, particularly for non-reduction patterns.


jeff


[COMMITTED] PR tree-optimization/109546 - Do not fold ADDR_EXPR conditions leading to builtin_unreachable early.

2023-04-21 Thread Andrew MacLeod via Gcc-patches
We cant represent ADDR_EXPR in ranges, so when we are processing 
builtin_unreachable() we should not be removing comparisons that utilize 
ADDR_EXPR during the early phases, or we lose some important information.


It was just an oversight that we think its a comparison to a 
representable constant.


Bootstrapped on x86_64-pc-linux-gnu with no regressions.  pushed.

This would also be suitable for the next GCC13 release when the branch 
is open.


Andrew
commit 0afefd11e25a05dd4f8a8624e8fb046d9c85686a
Author: Andrew MacLeod 
Date:   Fri Apr 21 15:03:43 2023 -0400

Do not fold ADDR_EXPR conditions leading to builtin_unreachable early.

Ranges can not represent  globally yet, so we cannot fold these
expressions early or we lose the __builtin_unreachable information.

PR tree-optimization/109546
gcc/
* tree-vrp.cc (remove_unreachable::remove_and_update_globals): Do
not fold conditions with ADDR_EXPR early.

gcc/testsuite/
* gcc.dg/pr109546.c: New.

diff --git a/gcc/testsuite/gcc.dg/pr109546.c b/gcc/testsuite/gcc.dg/pr109546.c
new file mode 100644
index 000..ba8af0f31fa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr109546.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+void foo(void);
+static int a, c;
+static int *b = 
+static int **d = 
+void assert_fail() __attribute__((__noreturn__));
+int main() {
+  int *e = *d;
+  if (e ==  || e == );
+  else {
+__builtin_unreachable();
+  assert_fail();
+  }
+  if (e ==  || e == );
+  else
+foo();
+}
+
+/* { dg-final { scan-tree-dump-not "assert_fail" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "foo" "optimized" } } */
+
+
diff --git a/gcc/tree-vrp.cc b/gcc/tree-vrp.cc
index f4d484526c7..9b870640e23 100644
--- a/gcc/tree-vrp.cc
+++ b/gcc/tree-vrp.cc
@@ -150,7 +150,9 @@ remove_unreachable::remove_and_update_globals (bool final_p)
   // If this is already a constant condition, don't look either
   if (!lhs_p && !rhs_p)
 	continue;
-
+  // Do not remove addresses early. ie if (x == )
+  if (!final_p && lhs_p && TREE_CODE (gimple_cond_rhs (s)) == ADDR_EXPR)
+	continue;
   bool dominate_exit_p = true;
   FOR_EACH_GORI_EXPORT_NAME (m_ranger.gori (), e->src, name)
 	{


Re: [PATCH] RISC-V: Refine reduction RA constraint according to RVV ISA

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/13/23 03:05, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

According to RVV ISA:
14. Vector Reduction Operations

"The destination vector register can overlap the source operands, including the mask 
register."

gcc/ChangeLog:

 * config/riscv/vector.md: Refine RA constraint.

OK.  Go ahead and install this on the trunk.

jeff


Re: [PATCH] RISC-V: Fine tune vmadc/vmsbc RA constraint

2023-04-21 Thread Jeff Law via Gcc-patches




On 3/13/23 18:38, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

gcc/ChangeLog:

 * config/riscv/vector.md: Fine tune vmadc/vmsbc RA constraint.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/base/narrow_constraint-13.c: New test.
 * gcc.target/riscv/rvv/base/narrow_constraint-14.c: New test.
 * gcc.target/riscv/rvv/base/narrow_constraint-15.c: New test.
 * gcc.target/riscv/rvv/base/narrow_constraint-16.c: New test.

This is OK.  Go ahead and install this on the trunk.
jeff


New Swedish PO file for 'gcc' (version 13.1-b20230409)

2023-04-21 Thread Translation Project Robot
Hello, gentle maintainer.

This is a message from the Translation Project robot.

A revised PO file for textual domain 'gcc' has been submitted
by the Swedish team of translators.  The file is available at:

https://translationproject.org/latest/gcc/sv.po

(This file, 'gcc-13.1-b20230409.sv.po', has just now been sent to you in
a separate email.)

All other PO files for your package are available in:

https://translationproject.org/latest/gcc/

Please consider including all of these in your next release, whether
official or a pretest.

Whenever you have a new distribution with a new version number ready,
containing a newer POT file, please send the URL of that distribution
tarball to the address below.  The tarball may be just a pretest or a
snapshot, it does not even have to compile.  It is just used by the
translators when they need some extra translation context.

The following HTML page has been updated:

https://translationproject.org/domain/gcc.html

If any question arises, please contact the translation coordinator.

Thank you for all your work,

The Translation Project robot, in the
name of your translation coordinator.




Re: [RFC PATCH v1 10/10] RISC-V: Support XVentanaCondOps extension

2023-04-21 Thread Jeff Law via Gcc-patches




On 2/10/23 15:41, Philipp Tomsich wrote:

The vendor-defined XVentanaCondOps extension adds two instructions
with semantics identical to Zicond.

This plugs the 2 new instruction in using the canonical RTX, which
also matches the combiner-input for noce_try_store_flag_mask and
noce_try_store_flag, defined for conditional-zero.

For documentation on XVentanaCondOps, refer to:
   
https://github.com/ventanamicro/ventana-custom-extensions/releases/download/v1.0.1/ventana-custom-extensions-v1.0.1.pdf

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_rtx_costs): Recognize idiom
for conditional zero as a single instruction for TARGET_XVENTANACONDOPS.
* config/riscv/riscv.md: Include xventanacondops.md.
* config/riscv/zicond.md: Enable splitters for TARGET_XVENTANACONDOPS.
* config/riscv/xventanacondops.md: New file.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xventanacondops-and-01.c: New test.
* gcc.target/riscv/xventanacondops-and-02.c: New test.
* gcc.target/riscv/xventanacondops-eq-01.c: New test.
* gcc.target/riscv/xventanacondops-eq-02.c: New test.
* gcc.target/riscv/xventanacondops-ifconv-imm.c: New test.
* gcc.target/riscv/xventanacondops-le-01.c: New test.
* gcc.target/riscv/xventanacondops-le-02.c: New test.
* gcc.target/riscv/xventanacondops-lt-01.c: New test.
* gcc.target/riscv/xventanacondops-lt-03.c: New test.
* gcc.target/riscv/xventanacondops-ne-01.c: New test.
* gcc.target/riscv/xventanacondops-ne-03.c: New test.
* gcc.target/riscv/xventanacondops-ne-04.c: New test.
* gcc.target/riscv/xventanacondops-xor-01.c: New test.

OK with the change to use if-then-else.

jeff


Re: [RFC PATCH v1 09/10] RISC-V: Recognize xventanacondops extension

2023-04-21 Thread Jeff Law via Gcc-patches




On 2/10/23 15:41, Philipp Tomsich wrote:

This adds the xventanacondops extension to the option parsing and as a
default for the ventana-vt1 core:

gcc/Changelog:

* common/config/riscv/riscv-common.cc: Recognize
   "xventanacondops" as part of an architecture string.
* config/riscv/riscv-opts.h (MASK_XVENTANACONDOPS): Define.
(TARGET_XVENTANACONDOPS): Define.
* config/riscv/riscv.opt: Add "riscv_xventanacondops".

Signed-off-by: Philipp Tomsich 

OK
jeff


Re: [RFC PATCH v1 07/10] RISC-V: Recognize bexti in negated if-conversion

2023-04-21 Thread Jeff Law via Gcc-patches




On 2/10/23 15:41, Philipp Tomsich wrote:

While the positive case "if ((bits >> SHAMT) & 1)" for SHAMT 0..10 can
trigger conversion into efficient branchless sequences
   - with Zbs (bexti + neg + and)
   - with Zicond (andi + czero.nez)
the inverted/negated case results in
   andi a5,a0,1024
   seqz a5,a5
   neg a5,a5
   and a5,a5,a1
due to how the sequence presents to the combine pass.

This adds an additional splitter to reassociate the polarity reversed
case into bexti + addi, if Zbs is present.

gcc/ChangeLog:

* config/riscv/zicond.md: Add split to reassociate
"andi + seqz + neg" into "bexti + addi".

OK.
jeff


Re: [PATCH] gcc/m2: Drop references to $(P)

2023-04-21 Thread Arsen Arsenović via Gcc-patches

Jakub Jelinek  writes:

> Doesn't fix any regression, so not ok for 13.1 and I wouldn't bother
> for 13.2 either.

Okay, but it can affect --enable-languages=all in a slim edge case.

Why not 13.2?  It seems sufficiently simple.

Thanks, have a lovely night!
-- 
Arsen Arsenović


signature.asc
Description: PGP signature


Re: [RFC PATCH v1 06/10] RISC-V: Recognize sign-extract + and cases for czero.eqz/nez

2023-04-21 Thread Jeff Law via Gcc-patches




On 2/10/23 15:41, Philipp Tomsich wrote:

Users might use explicit arithmetic operations to create a mask and
then and it, in a sequence like
 cond = (bits >> SHIFT) & 1;
 mask = ~(cond - 1);
 val &= mask;
which will present as a single-bit sign-extract.

Dependening on what combination of XVentanaCondOps and Zbs are
available, this will map to the following sequences:
  - bexti + czero, if both Zbs and XVentanaCondOps are present
  - andi + czero,  if only XVentanaCondOps is available and the
  sign-extract is operating on bits 10:0 (bit 11
  can't be reached, as the immediate is
  sign-extended)
  - slli + srli + and, otherwise.

gcc/ChangeLog:

* config/riscv/zicond.md: Recognize SIGN_EXTRACT of a
single-bit followed by AND for Zicond.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/zicond-le-01.c: New test.
Conceptually OK.  In fact using bext to drive if conversions is 
something I think we've got in our queue of things to investigate here. 
So you may have just made Raphael's work easier ;-)


As with the other patches we just need to adjust to using the 
if-then-else form.   You've got a mention of XVentanaCondOps in the 
comments, you might want to change that to zicond.



jeff


[PATCH] testsuite: Add testcase for sparc ICE [PR105573]

2023-04-21 Thread Sam James via Gcc-patches
r11-10018-g33914983cf3734c2f8079963ba49fcc117499ef3 fixed PR105312 and added
a test case for target/arm but the duplicate PR105573 has a test case for
target/sparc that was uncommitted until now.

2023-04-21  Sam James   
PR tree-optimization/105312
PR target/105573
* gcc/testsuite/gcc.target/sparc/pr105573.c: New test.

Signed-off-by: Sam James 
---
 gcc/testsuite/gcc.target/sparc/pr105573.c | 14 ++
 1 file changed, 14 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/sparc/pr105573.c

diff --git a/gcc/testsuite/gcc.target/sparc/pr105573.c 
b/gcc/testsuite/gcc.target/sparc/pr105573.c
new file mode 100644
index 000..9eba2e4beba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/sparc/pr105573.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mvis3" } */
+
+int *UINT_sign_args, UINT_sign_steps;
+int *UINT_sign_ip1;
+__attribute__((optimize("O3"))) void UINT_sign() {
+  char *op1 = (char*) UINT_sign_args;
+  int os1 = UINT_sign_steps, i;
+  for (; i; i++, op1 += os1) {
+unsigned in = *(unsigned *)UINT_sign_ip1;
+int *out = (int*) op1;
+*out = in > 0;
+  }
+}
-- 
2.40.0



[pushed] c++: fix 'unsigned typedef-name' extension [PR108099]

2023-04-21 Thread Jason Merrill via Gcc-patches
Tested x86_64-pc-linux-gnu, applying to trunk.

-- 8< --

In the comments for PR108099 Jakub provided some testcases that demonstrated
that even before the regression noted in the patch we were getting the
semantics of this extension wrong: in the unsigned case we weren't producing
the corresponding standard unsigned type but another distinct one of the
same size, and in the signed case we were just dropping it on the floor and
not actually returning a signed type at all.

The former issue is fixed by using c_common_signed_or_unsigned_type instead
of unsigned_type_for, and the latter issue by adding a (signed_p &&
typedef_decl) case.

This patch introduces a failure on std/ranges/iota/max_size_type.cc due to
the latter issue, since the testcase expects 'signed rep_t' to do something
sensible, and previously we didn't.  Now that we do, it exposes a bug in the
__max_diff_type::operator>>= handling of sign extension: when we evaluate
-1000 >> 2 in __max_diff_type we keep the MSB set, but leave the
second-most-significant bit cleared.

PR c++/108099

gcc/cp/ChangeLog:

* decl.cc (grokdeclarator): Don't clear typedef_decl after 'unsigned
typedef' pedwarn.  Use c_common_signed_or_unsigned_type.  Also
handle 'signed typedef'.

gcc/testsuite/ChangeLog:

* g++.dg/ext/int128-8.C: Remove xfailed dg-bogus markers.
* g++.dg/ext/unsigned-typedef2.C: New test.
* g++.dg/ext/unsigned-typedef3.C: New test.
---
 gcc/cp/decl.cc   | 18 +++---
 gcc/testsuite/g++.dg/ext/int128-8.C  |  4 ++--
 gcc/testsuite/g++.dg/ext/unsigned-typedef2.C | 25 
 gcc/testsuite/g++.dg/ext/unsigned-typedef3.C | 25 
 4 files changed, 60 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/ext/unsigned-typedef2.C
 create mode 100644 gcc/testsuite/g++.dg/ext/unsigned-typedef3.C

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index ab5cb69b2ae..71d33d2b7a4 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -12478,18 +12478,14 @@ grokdeclarator (const cp_declarator *declarator,
{
  if (typedef_decl)
{
- pedwarn (loc, OPT_Wpedantic, "%qs specified with %qD",
+ pedwarn (loc, OPT_Wpedantic,
+  "%qs specified with typedef-name %qD",
   key, typedef_decl);
  ok = !flag_pedantic_errors;
+ /* PR108099: __int128_t comes from c_common_nodes_and_builtins,
+and is not built as a typedef.  */
  if (is_typedef_decl (typedef_decl))
-   {
- type = DECL_ORIGINAL_TYPE (typedef_decl);
- typedef_decl = NULL_TREE;
-   }
- else
-   /* PR108099: __int128_t comes from c_common_nodes_and_builtins,
-  and is not built as a typedef.  */
-   type = TREE_TYPE (typedef_decl);
+   type = DECL_ORIGINAL_TYPE (typedef_decl);
}
  else if (declspecs->decltype_p)
error_at (loc, "%qs specified with %", key);
@@ -12542,7 +12538,7 @@ grokdeclarator (const cp_declarator *declarator,
   else if (type == char_type_node)
type = unsigned_char_type_node;
   else if (typedef_decl)
-   type = unsigned_type_for (type);
+   type = c_common_unsigned_type (type);
   else
type = unsigned_type_node;
 }
@@ -12556,6 +12552,8 @@ grokdeclarator (const cp_declarator *declarator,
 type = long_integer_type_node;
   else if (short_p)
 type = short_integer_type_node;
+  else if (signed_p && typedef_decl)
+type = c_common_signed_type (type);
 
   if (decl_spec_seq_has_spec_p (declspecs, ds_complex))
 {
diff --git a/gcc/testsuite/g++.dg/ext/int128-8.C 
b/gcc/testsuite/g++.dg/ext/int128-8.C
index 7e909d50873..07535a9820e 100644
--- a/gcc/testsuite/g++.dg/ext/int128-8.C
+++ b/gcc/testsuite/g++.dg/ext/int128-8.C
@@ -16,8 +16,8 @@ struct is_same  : true_type {};
 static_assert (is_same <__int128, s128>::value, "");
 static_assert (is_same ::value, "");
 static_assert (is_same <__int128_t, s128>::value, "");
-static_assert (is_same ::value, ""); // { dg-bogus "" 
"" { xfail *-*-* } }
-static_assert (is_same <__uint128_t, u128>::value, "");  // { 
dg-bogus "" "" { xfail *-*-* } }
+static_assert (is_same ::value, "");
+static_assert (is_same <__uint128_t, u128>::value, "");
 static_assert (sizeof (s128) == sizeof (__int128), "");
 static_assert (sizeof (u128) == sizeof (unsigned __int128), "");
 static_assert (s128(-1) < 0, "");
diff --git a/gcc/testsuite/g++.dg/ext/unsigned-typedef2.C 
b/gcc/testsuite/g++.dg/ext/unsigned-typedef2.C
new file mode 100644
index 000..936c0ccb748
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/unsigned-typedef2.C
@@ -0,0 +1,25 @@
+// PR c++/108099
+// { dg-do compile { target c++11 } }
+// { dg-options "" }
+
+typedef long long t64;
+template  struct integral_constant {
+  

Re: [RFC PATCH v1 05/10] RISC-V: Support noce_try_store_flag_mask as czero.eqz/czero.nez

2023-04-21 Thread Jeff Law via Gcc-patches




On 2/10/23 15:41, Philipp Tomsich wrote:

When if-conversion in noce_try_store_flag_mask starts the sequence off
with an order-operator, our patterns for czero.eqz/nez will receive
the result of the order-operator as a register argument; consequently,
they can't know that the result will be either 1 or 0.

To convey this information (and make czero.eqz/nez applicable), we
wrap the result of the order-operator in a eq/ne against (const_int 0).
This commit adds the split pattern to handle these cases.

During if-conversion, if noce_try_store_flag_mask succeeds, we may see
 if (cur < next) {
next = 0;
 }
transformed into
27: r82:SI=ltu(r76:DI,r75:DI)
   REG_DEAD r76:DI
28: r81:SI=r82:SI^0x1
   REG_DEAD r82:SI
29: r80:DI=zero_extend(r81:SI)
   REG_DEAD r81:SI

This currently escapes the combiner, as RISC-V does not have a pattern
to apply the 'slt' instruction to 'geu' verbs.  By adding a pattern in
this commit, we match such cases.

gcc/ChangeLog:

* config/riscv/predicates.md (anyge_operator): Define.
(anygt_operator): Same.
(anyle_operator): Same.
(anylt_operator): Same.
* config/riscv/riscv.md: Helpers for ge(u) & le(u).
* config/riscv/zicond.md: Add split to wrap an an
order-operator suitably for generating czero.eqz/nez

gcc/testsuite/ChangeLog:

* gcc.target/riscv/zicond-le-02.c: New test.
* gcc.target/riscv/zicond-lt-03.c: New test.
Conceptually OK.  As has been noted, we need to switch to the 
if-then_else form rather than (and (neg)).OK with that change.


jeff


Re: [PATCH] gcc/m2: Drop references to $(P)

2023-04-21 Thread Jakub Jelinek via Gcc-patches
On Fri, Apr 21, 2023 at 08:27:22PM +0200, Arsen Arsenović wrote:
> Hi Gaius,
> 
> Gaius Mulley  writes:
> 
> > yes certainly this is fine.  lgtm.  Thanks for spotting and the patch
> 
> Sure.  Will push to master and wait for a RM to weigh in on 13.

Doesn't fix any regression, so not ok for 13.1 and I wouldn't bother
for 13.2 either.

Jakub



Re: [PATCH] Implement range-op entry for sin/cos.

2023-04-21 Thread Jakub Jelinek via Gcc-patches
On Thu, Apr 20, 2023 at 02:59:35PM +0200, Jakub Jelinek via Gcc-patches wrote:
> Thanks for working on this.  Though expectedly here we are running
> again into the discussions we had in November about math properties of the
> functions vs. numeric properties in their implementations, how big maximum
> error shall we expect for the functions (and whether we should hardcode
> it for all implementations, or have some more fine-grained list of expected
> ulp errors for each implementation), whether the implementations at least
> guarantee the basic mathematical properties of the functions even if they
> have some errors (say for sin/cos, if they really never return > 1.0 or <
> -1.0) and the same questions repeated for -frounding-math, what kind of
> extra errors to expect when using non-default rounding and whether say sin
> could ever return nextafter (1.0, 2.0) or even larger value say when
> using non-default rounding mode.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606466.html
> was my attempt to get at least some numbers on some targets, I'm afraid
> for most implementations we aren't going to get any numerical proofs of
> maximum errors and the like.  For sin/cos to check whether the implementation
> really never returns > 1.0 or < -1.0 perhaps instead of using randomized
> testing we could exhaustively check some range around 0, M_PI, 3*M_PI,
> -M_PI, -3*M_PI, and say some much larger multiples of M_PI, say 50 ulps
> in each direction about those points, and similarly for sin around M_PI/2
> etc., in all arounding modes.

Appart from the ulps ranges which I plan to introduce a target hook next
week...

> > +if (!lh.maybe_isnan ())
> 
> This condition isn't sufficient, even if lh can't be NAN, but just
> may be +Inf or -Inf, the result needs to include maybe NAN.

I've incorporated my other comments into a patch form.
I also found missing undefined_p () checks in two spots, the
r.set_undefined (); stuff being misplaced (it was done in the
lhs.maybe_isnan () case, which is incorrect, if the lhs may be nan,
then even if the finite range is say [-30., -10.] or [1.5., 42.],
result shouldn't be invalid, because the result still could be NAN
and in that case operand of say +-Inf or NAN would be valid.
Actually thinking about it some more, perhaps we should do that check
before the maybe_isnan () check and if we find the impossible finite,
either use r.set_undefined (); of !maybe_isnan () or handle it like
known_isnan () otherwise.

Also, I think we should remember if it is SIN or COS, we'll need it both
for the ulps case and if we improve it for (ub-lb) < 2*PI ranges.
Now, talking about that, I'd very much like to avoid finding if some
multiple of PI/2 occurs inside of such ranges, the precision of real.cc
is clearly not sufficient for that, but perhaps we could use derivative
of sin (cos) or of cos (sin) to see if the function on the boundary is
increasing or decreasing and from that on both boundaries and their
approximate distance find out if the range needs to be extended to +1
or -1.

So, just incremental WIP so far...

--- gcc/gimple-range-op.cc.jj   2023-04-21 17:09:48.250367999 +0200
+++ gcc/gimple-range-op.cc  2023-04-21 18:37:26.048325391 +0200
@@ -405,17 +405,20 @@ class cfn_sincos : public range_operator
 public:
   using range_operator_float::fold_range;
   using range_operator_float::op1_range;
+  cfn_sincos (combined_fn cfn) { m_cfn = cfn; }
   virtual bool fold_range (frange , tree type,
   const frange , const frange &,
   relation_trio) const final override
   {
+if (lh.undefined_p ())
+  return false;
 if (lh.known_isnan () || lh.known_isinf ())
   {
r.set_nan (type);
return true;
   }
 r.set (type, dconstm1, dconst1);
-if (!lh.maybe_isnan ())
+if (!lh.maybe_isnan () && !lh.maybe_isinf ())
   r.clear_nan ();
 return true;
   }
@@ -423,15 +426,9 @@ public:
  const frange , const frange &,
  relation_trio) const final override
   {
-if (!lhs.maybe_isnan ())
-  {
-   // If NAN is not valid result, the input cannot include either
-   // a NAN nor a +-INF.
-   REAL_VALUE_TYPE lb = real_min_representable (type);
-   REAL_VALUE_TYPE ub = real_max_representable (type);
-   r.set (type, lb, ub, nan_state (false, false));
-   return true;
-  }
+if (lhs.undefined_p ())
+  return false;
+
 // A known NAN means the input is [-INF,-INF][+INF,+INF] U +-NAN,
 // which we can't currently represent.
 if (lhs.known_isnan ())
@@ -439,20 +436,38 @@ public:
r.set_varying (type);
return true;
   }
+
 // Results outside of [-1.0, +1.0] are impossible.
 REAL_VALUE_TYPE lb = lhs.lower_bound ();
 REAL_VALUE_TYPE ub = lhs.upper_bound ();
-if (real_less (, )
-   || real_less (, ))
+if (real_less (, ) || real_less (, ))
   {
-   

Re: [PATCH] tree, c++: declare some basic functions inline

2023-04-21 Thread Jason Merrill via Gcc-patches

On 4/21/23 13:07, Patrick Palka wrote:

On Sun, 4 Dec 2022, Patrick Palka wrote:


The functions strip_array_types, is_typedef_decl, typedef_variant_p,
cp_type_quals and cp_expr_location are used throughout the C++ frontend
including in some fairly hot parts (e.g. in the tsubst routines and
cp_walk_subtree) and they're small enough that the overhead of calling
them out-of-line is relatively significant.

This patch moves their definitions into the appropriate headers to
enable inlining them.  This speeds up the C++ frontend by ~1% according
to my experiments.  In passing this also downgrades the assert in
cp_type_quals to a checking assert.

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK
for trunk stage3 or perhaps for stage1?


Ping.


OK.



gcc/cp/ChangeLog:

* cp-tree.h (cp_type_quals): Define here.  Downgrade assert into
a checking assert.
(cp_expr_location): Define here.
* tree.cc (cp_expr_location): Don't define here.
* typeck.cc (cp_type_quals): Don't define here.

gcc/ChangeLog:

* tree.cc (strip_array_types): Don't define here.
(is_typedef_decl): Don't define here.
(typedef_variant_p): Don't define here.
* tree.h (strip_array_types): Define here.
(is_typedef_decl): Define here.
(typedef_variant_p): Define here.
---
  gcc/cp/cp-tree.h | 50 ++--
  gcc/cp/tree.cc   | 24 ---
  gcc/cp/typeck.cc | 23 --
  gcc/tree.cc  | 29 
  gcc/tree.h   | 32 ---
  5 files changed, 77 insertions(+), 81 deletions(-)

diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
index addd26ea077..19914d08a03 100644
--- a/gcc/cp/cp-tree.h
+++ b/gcc/cp/cp-tree.h
@@ -49,7 +49,7 @@ c-common.h, not after.
 but not all node kinds do (e.g. constants, and references to
 params, locals, etc), so we stash a copy here.  */
  
-extern location_t cp_expr_location		(const_tree);

+inline location_t cp_expr_location (const_tree);
  
  class cp_expr

  {
@@ -8100,7 +8100,6 @@ extern bool error_type_p  (const_tree);
  extern bool ptr_reasonably_similar(const_tree, const_tree);
  extern tree build_ptrmemfunc  (tree, tree, int, bool,
 tsubst_flags_t);
-extern int cp_type_quals   (const_tree);
  extern int type_memfn_quals   (const_tree);
  extern cp_ref_qualifier type_memfn_rqual  (const_tree);
  extern tree apply_memfn_quals (tree, cp_cv_quals,
@@ -8151,6 +8150,29 @@ extern void maybe_warn_about_useless_cast   
(location_t, tree, tree,
 tsubst_flags_t);
  extern tree cp_perform_integral_promotions  (tree, tsubst_flags_t);
  
+/* Returns the type qualifiers for this type, including the qualifiers on the

+   elements for an array type.  */
+
+inline int
+cp_type_quals (const_tree type)
+{
+  int quals;
+  /* This CONST_CAST is okay because strip_array_types returns its
+ argument unmodified and we assign it to a const_tree.  */
+  type = strip_array_types (CONST_CAST_TREE (type));
+  if (type == error_mark_node
+  /* Quals on a FUNCTION_TYPE are memfn quals.  */
+  || TREE_CODE (type) == FUNCTION_TYPE)
+return TYPE_UNQUALIFIED;
+  quals = TYPE_QUALS (type);
+  /* METHOD and REFERENCE_TYPEs should never have quals.  */
+  gcc_checking_assert ((TREE_CODE (type) != METHOD_TYPE
+   && !TYPE_REF_P (type))
+  || ((quals & (TYPE_QUAL_CONST|TYPE_QUAL_VOLATILE))
+  == TYPE_UNQUALIFIED));
+  return quals;
+}
+
  extern tree finish_left_unary_fold_expr  (tree, int);
  extern tree finish_right_unary_fold_expr (tree, int);
  extern tree finish_binary_fold_expr  (tree, tree, int);
@@ -8168,6 +8190,30 @@ loc_or_input_loc (location_t loc)
return loc == UNKNOWN_LOCATION ? input_location : loc;
  }
  
+/* Like EXPR_LOCATION, but also handle some tcc_exceptional that have

+   locations.  */
+
+inline location_t
+cp_expr_location (const_tree t_)
+{
+  tree t = CONST_CAST_TREE (t_);
+  if (t == NULL_TREE)
+return UNKNOWN_LOCATION;
+  switch (TREE_CODE (t))
+{
+case LAMBDA_EXPR:
+  return LAMBDA_EXPR_LOCATION (t);
+case STATIC_ASSERT:
+  return STATIC_ASSERT_SOURCE_LOCATION (t);
+case TRAIT_EXPR:
+  return TRAIT_EXPR_LOCATION (t);
+case PTRMEM_CST:
+  return PTRMEM_CST_LOCATION (t);
+default:
+  return EXPR_LOCATION (t);
+}
+}
+
  inline location_t
  cp_expr_loc_or_loc (const_tree t, location_t or_loc)
  {
diff --git a/gcc/cp/tree.cc b/gcc/cp/tree.cc
index 1487f4975c5..4066b014f6e 100644
--- a/gcc/cp/tree.cc
+++ b/gcc/cp/tree.cc
@@ -6214,30 +6214,6 @@ cp_tree_code_length (enum tree_code code)
  }
  }
  
-/* Like EXPR_LOCATION, but also handle 

Re: [PATCH 1/2] Use NO_REGS in cost calculation when the preferred register class are not known yet.

2023-04-21 Thread Vladimir Makarov via Gcc-patches



On 4/19/23 20:46, liuhongt via Gcc-patches wrote:

1547  /* If this insn loads a parameter from its stack slot, then it
1548 represents a savings, rather than a cost, if the parameter is
1549 stored in memory.  Record this fact.
1550
1551 Similarly if we're loading other constants from memory (constant
1552 pool, TOC references, small data areas, etc) and this is the only
1553 assignment to the destination pseudo.

At that time, preferred regclass is unknown, and GENERAL_REGS is used to
record memory move cost, but it's not accurate especially for large vector
modes, i.e. 512-bit vector in x86 which would most probably allocate with
SSE_REGS instead of GENERAL_REGS. Using GENERAL_REGS here will overestimate
the cost of this load and make RA propagate the memeory operand into many
consume instructions which causes worse performance.


For this case GENERAL_REGS was used in GCC practically all the time.  
You can check this in the old regclass.c file (existing until IRA 
introduction).


But I guess it is ok to use NO_REGS for this to promote more usage of 
registers instead of equiv memory and as a lot of code was changed since 
then (the old versions of GCC even did not support vector regs).


Although it would be nice to do some benchmarking (SPEC is preferable) 
for such kind of changes.


On the other hand, I expect that any performance regression (if any) 
will be reported anyway.


The patch is ok for me.  You can commit it into the trunk.

Thank you for addressing this issue.


Fortunately, NO_REGS is used to record the best scenario, so the patch uses
NO_REGS instead of GENERAL_REGS here, it could help RA in PR108707.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
and aarch64-linux-gnu.
Ok for trunk?

gcc/ChangeLog:

PR rtl-optimization/108707
* ira-costs.cc (scan_one_insn): Use NO_REGS instead of
GENERAL_REGS when preferred reg_class is not known.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr108707.c: New test.




Re: [PATCH v2] configure: Only create serdep.tmp if needed

2023-04-21 Thread Jeff Law via Gcc-patches




On 1/16/23 18:12, Peter Foley wrote:

There's no reason to create this file if none of the serial configure
options are passed.

v2: Use test instead of [ to avoid running afoul of autoconf quoting.

ChangeLog:

* configure: Regenerate.
* configure.ac: Only create serdep.tmp if needed
Thanks.  I bootstrapped and regression tested x86 without issues and 
pushed this patch to the trunk.


jeff


Re: [PATCH] gcc/m2: Drop references to $(P)

2023-04-21 Thread Arsen Arsenović via Gcc-patches
Hi Gaius,

Gaius Mulley  writes:

> yes certainly this is fine.  lgtm.  Thanks for spotting and the patch

Sure.  Will push to master and wait for a RM to weigh in on 13.

Thanks!
-- 
Arsen Arsenović


signature.asc
Description: PGP signature


[committed] [PR testsuite/109549] Adjust x86 testsuite for recent if-conversion cost checking

2023-04-21 Thread Jeff Law
This test expected if-conversion to happen for a sequence which appears 
to always cost more than a branchy sequence.  This was exposed by a 
recent change to the if-converter to add checking in a path where it was 
missing.


So I've just adjusted the test to assume it should never if-convert into 
cmov instructions.


Committed to the trunk.  There's a few of the embedded targets that are 
regressing in similar manners, so I'm not closing the PR yet.


jeffcommit f1f18198b069f461155191ecba41bc87bf5689dd
Author: Jeff Law 
Date:   Fri Apr 21 12:22:24 2023 -0600

Adjust x86 testsuite for recent if-conversion cost checking

gcc/testsuite
PR testsuite/109549
* gcc.target/i386/cmov6.c: No longer expect this test to
generate 'cmov' instructions.

diff --git a/gcc/testsuite/gcc.target/i386/cmov6.c 
b/gcc/testsuite/gcc.target/i386/cmov6.c
index 535326e4c2a..5111c8a9099 100644
--- a/gcc/testsuite/gcc.target/i386/cmov6.c
+++ b/gcc/testsuite/gcc.target/i386/cmov6.c
@@ -1,6 +1,9 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -march=k8" } */
-/* { dg-final { scan-assembler "cmov\[^6\]" } } */
+/* if-converting this sequence would require two cmov
+   instructions and seems to always cost more independent
+   of the TUNE_ONE_IF_CONV setting.  */
+/* { dg-final { scan-assembler-not "cmov\[^6\]" } } */
 
 /* Verify that blocks are converted to conditional moves.  */
 extern int bar (int, int);


Re: [PATCH] gcc/m2: Drop references to $(P)

2023-04-21 Thread Gaius Mulley via Gcc-patches
Arsen Arsenović  writes:

> $(P) seems to have been a workaround for some old, proprietary make
> implementations that we no longer support.  It was removed in
> r0-31149-gb8dad04b688e9c.
>
> gcc/m2/ChangeLog:
>
>   * Make-lang.in: Remove references to $(P).
>   * Make-maintainer.in: Ditto.
> ---
> Hi,
>
> We spotted that the m2 makefile includes some long-gone compatibility
> variable $(P), presumably left over from when m2 was not in the tree
> yet.  This induced a build failure on our end:
> https://bugs.gentoo.org/904714
>
> Build-tested on x86_64-pc-linux-gnu.  I haven't finished running the
> testsuite.  I believe this only ever expands to an empty string (if not
> set by the env) in the current build system, so in theory, it should be
> safe.
>
> OK for gcc-13 and trunk (with a priority on the former)?
>
>  gcc/m2/Make-lang.in   | 4 ++--
>  gcc/m2/Make-maintainer.in | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/m2/Make-lang.in b/gcc/m2/Make-lang.in
> index b34db0d9156..d0fa692e5b8 100644
> --- a/gcc/m2/Make-lang.in
> +++ b/gcc/m2/Make-lang.in
> @@ -514,7 +514,7 @@ GM2_LIBS_BOOT = m2/gm2-compiler-boot/gm2.a \
>  cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) $(m2.prev)
>   cp -p $< $@
>  
> -m2/stage2/cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) 
> m2/gm2-compiler/m2flex.o $(P) \
> +m2/stage2/cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) 
> m2/gm2-compiler/m2flex.o \
>  $(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) $(GM2_LIBS) \
>  m2/gm2-gcc/rtegraph.o plugin/m2rte$(soext)
>   -test -d $(@D) || $(mkinstalldirs) $(@D)
> @@ -527,7 +527,7 @@ m2/stage2/cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) 
> m2/gm2-compiler/m2flex.o $(
>   @$(call LINK_PROGRESS,$(INDEX.m2),end)
>  
>  m2/stage1/cc1gm2$(exeext): gm2$(exeext) m2/gm2-compiler-boot/m2flex.o \
> -$(P) $(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) \
> +$(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) \
>  $(GM2_LIBS_BOOT) $(MC_LIBS) \
>  m2/gm2-gcc/rtegraph.o plugin/m2rte$(soext) \
>  $(m2.prev)
> diff --git a/gcc/m2/Make-maintainer.in b/gcc/m2/Make-maintainer.in
> index 17bd9a2d37e..a70682673cd 100644
> --- a/gcc/m2/Make-maintainer.in
> +++ b/gcc/m2/Make-maintainer.in
> @@ -753,7 +753,7 @@ GM2_LIBS_PARANOID = m2/gm2-compiler-paranoid/gm2.a \
>  gm2.paranoid: m2/m2obj3/cc1gm2$(exeext) gm2.verifyparanoid
>  
>  m2/m2obj3/cc1gm2$(exeext): m2/m2obj2/cc1gm2$(exeext) 
> m2/gm2-compiler-paranoid/m2flex.o \
> -$(P) $(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) 
> $(GM2_LIBS_PARANOID) \
> +$(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) 
> $(GM2_LIBS_PARANOID) \
>  m2/gm2-gcc/rtegraph.o plugin/m2rte$(exeext).so 
> m2/gm2-libs-boot/M2LINK.o
>   -test -d m2/m2obj3 || $(mkinstalldirs) m2/m2obj3
>   @$(call LINK_PROGRESS,$(INDEX.m2),start)

Hi,

yes certainly this is fine.  lgtm.  Thanks for spotting and the patch

regards,
Gaius


[PATCH][committed] aarch64: Emit single-instruction for smin (x, 0) and smax (x, 0)

2023-04-21 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

Motivated by https://reviews.llvm.org/D148249, we can expand to a single 
instruction
for the SMIN (x, 0) and SMAX (x, 0) cases using the combined AND/BIC and ASR 
operations.
Given that we already have well-fitting TARGET_CSSC patterns and expanders for 
the min/max codes
in the backend this patch does some minor refactoring to ensure we emit the 
right SMAX/SMIN RTL codes
for TARGET_CSSC, fall back to the generic expanders or emit a simple SMIN/SMAX 
with 0 RTX for !TARGET_CSSC
that is now matched by a separate pattern.

Bootstrapped and tested on aarch64-none-linux-gnu.

Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

* config/aarch64/aarch64.md (aarch64_umax3_insn): Delete.
(umax3): Emit raw UMAX RTL instead of going through gen_ function
for umax.
(3): New define_expand for MAXMIN_NOUMAX codes.
(*aarch64_3_zero): Define.
(*aarch64_3_cssc): Likewise.
* config/aarch64/iterators.md (maxminand): New code attribute.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sminmax-asr_1.c: New test.


minmax.patch
Description: minmax.patch


[PATCH][committed] PR target/108779 aarch64: Implement -mtp= option

2023-04-21 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

A user has requested that we support the -mtp= option in aarch64 GCC for 
changing
the TPIDR register to read for TLS accesses. I'm not a big fan of the option 
name,
but we already support it in the arm port and Clang supports it for AArch64 
already [1],
where it accepts the 'el0', 'el1', 'el2', 'el3' values.

This patch implements the same functionality in GCC.

Bootstrapped and tested on aarch64-none-linux-gnu.
Confirmed with godbolt that the sequences and options are the same as what 
Clang accepts/generates.
Pushing to trunk.

Thanks,
Kyrill


[1] 
https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-mtp

gcc/ChangeLog:

PR target/108779
* config/aarch64/aarch64-opts.h (enum aarch64_tp_reg): Define.
* config/aarch64/aarch64-protos.h (aarch64_output_load_tp):
Define prototype.
* config/aarch64/aarch64.cc (aarch64_tpidr_register): Declare.
(aarch64_override_options_internal): Handle the above.
(aarch64_output_load_tp): New function.
* config/aarch64/aarch64.md (aarch64_load_tp_hard): Call
aarch64_output_load_tp.
* config/aarch64/aarch64.opt (aarch64_tp_reg): Define enum.
(mtp=): New option.
* doc/invoke.texi (AArch64 Options): Document -mtp=.

gcc/testsuite/ChangeLog:

PR target/108779
* gcc.target/aarch64/mtp.c: New test.
* gcc.target/aarch64/mtp_1.c: New test.
* gcc.target/aarch64/mtp_2.c: New test.
* gcc.target/aarch64/mtp_3.c: New test.
* gcc.target/aarch64/mtp_4.c: New test.


mtp.patch
Description: mtp.patch


[PATCH][committed] aarch64: PR target/99195 Add scheme to optimise away vec_concat with zeroes on 64-bit Advanced SIMD ops

2023-04-21 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

I finally got around to trying out the define_subst approach for PR 
target/99195.
The problem we have is that many Advanced SIMD instructions have 64-bit vector 
variants that
clear the top half of the 128-bit Q register. This would allow the compiler to 
avoid generating
explicit zeroing instructions to concat the 64-bit result with zeroes for code 
like:
vcombine_u16(vadd_u16(a, b), vdup_n_u16(0))
We've been getting user reports of GCC missing this optimisation in real world 
code, so it's worth
doing something about it.
The straightforward approach that we've been taking so far is adding extra 
patterns in aarch64-simd.md
that match the 64-bit result in a vec_concat with zeroes. Unfortunately for 
big-endian the vec_concat
operands to match must be the other way around, so we would end up adding two 
extra define_insns.
This would lead to too much bloat in aarch64-simd.md.

This patch defines a pair of define_subst constructs that allow us to annotate 
patterns in aarch64-simd.md
with the  and  subst_attrs and the compiler will automatically 
produce the vec_concat widening patterns,
properly gated for BYTES_BIG_ENDIAN when needed. This seems like the least 
intrusive way to describe the extra zeroing semantics.

I've had a look at the generated insn-*.cc files in the build directory and it 
seems that define_subst does what we want it to do
when applied multiple times on a pattern in terms of insn conditions and modes.

This patch adds the define_subst machinery and adds the annotations to some of 
the straightforward binary and unary integer
operations. Many more such annotations are possible, and I aim add them in 
future patches.

Bootstrapped and tested on aarch64-none-linux-gnu and on aarch64_be-none-elf.

Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (add_vec_concat_subst_le): Define.
(add_vec_concat_subst_be): Likewise.
(vczle): Likewise.
(vczbe): Likewise.
(add3): Rename to...
(add3): ... This.
(sub3): Rename to...
(sub3): ... This.
(mul3): Rename to...
(mul3): ... This.
(and3): Rename to...
(and3): ... This.
(ior3): Rename to...
(ior3): ... This.
(xor3): Rename to...
(xor3): ... This.
* config/aarch64/iterators.md (VDZ): Define.

gcc/testsuite/ChangeLog:

PR target/99195
* gcc.target/aarch64/simd/pr99195_1.c: New test.


vcz.patch
Description: vcz.patch


[PATCH] gcc/m2: Drop references to $(P)

2023-04-21 Thread Arsen Arsenović via Gcc-patches
$(P) seems to have been a workaround for some old, proprietary make
implementations that we no longer support.  It was removed in
r0-31149-gb8dad04b688e9c.

gcc/m2/ChangeLog:

* Make-lang.in: Remove references to $(P).
* Make-maintainer.in: Ditto.
---
Hi,

We spotted that the m2 makefile includes some long-gone compatibility
variable $(P), presumably left over from when m2 was not in the tree
yet.  This induced a build failure on our end:
https://bugs.gentoo.org/904714

Build-tested on x86_64-pc-linux-gnu.  I haven't finished running the
testsuite.  I believe this only ever expands to an empty string (if not
set by the env) in the current build system, so in theory, it should be
safe.

OK for gcc-13 and trunk (with a priority on the former)?

 gcc/m2/Make-lang.in   | 4 ++--
 gcc/m2/Make-maintainer.in | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/m2/Make-lang.in b/gcc/m2/Make-lang.in
index b34db0d9156..d0fa692e5b8 100644
--- a/gcc/m2/Make-lang.in
+++ b/gcc/m2/Make-lang.in
@@ -514,7 +514,7 @@ GM2_LIBS_BOOT = m2/gm2-compiler-boot/gm2.a \
 cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) $(m2.prev)
cp -p $< $@
 
-m2/stage2/cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) m2/gm2-compiler/m2flex.o 
$(P) \
+m2/stage2/cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) m2/gm2-compiler/m2flex.o \
 $(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) $(GM2_LIBS) \
 m2/gm2-gcc/rtegraph.o plugin/m2rte$(soext)
-test -d $(@D) || $(mkinstalldirs) $(@D)
@@ -527,7 +527,7 @@ m2/stage2/cc1gm2$(exeext): m2/stage1/cc1gm2$(exeext) 
m2/gm2-compiler/m2flex.o $(
@$(call LINK_PROGRESS,$(INDEX.m2),end)
 
 m2/stage1/cc1gm2$(exeext): gm2$(exeext) m2/gm2-compiler-boot/m2flex.o \
-$(P) $(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) \
+$(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) \
 $(GM2_LIBS_BOOT) $(MC_LIBS) \
 m2/gm2-gcc/rtegraph.o plugin/m2rte$(soext) \
 $(m2.prev)
diff --git a/gcc/m2/Make-maintainer.in b/gcc/m2/Make-maintainer.in
index 17bd9a2d37e..a70682673cd 100644
--- a/gcc/m2/Make-maintainer.in
+++ b/gcc/m2/Make-maintainer.in
@@ -753,7 +753,7 @@ GM2_LIBS_PARANOID = m2/gm2-compiler-paranoid/gm2.a \
 gm2.paranoid: m2/m2obj3/cc1gm2$(exeext) gm2.verifyparanoid
 
 m2/m2obj3/cc1gm2$(exeext): m2/m2obj2/cc1gm2$(exeext) 
m2/gm2-compiler-paranoid/m2flex.o \
-$(P) $(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) 
$(GM2_LIBS_PARANOID) \
+$(GM2_C_OBJS) $(BACKEND) $(LIBDEPS) 
$(GM2_LIBS_PARANOID) \
 m2/gm2-gcc/rtegraph.o plugin/m2rte$(exeext).so 
m2/gm2-libs-boot/M2LINK.o
-test -d m2/m2obj3 || $(mkinstalldirs) m2/m2obj3
@$(call LINK_PROGRESS,$(INDEX.m2),start)
-- 
2.40.0



Re: [PATCH] RFC: New compact syntax for insn and insn_split in Machine Descriptions

2023-04-21 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> This patch adds support for a compact syntax for specifying constraints in
> instruction patterns. Credit for the idea goes to Richard Earnshaw.
>
> I am sending up this RFC to get feedback for it's inclusion in GCC 14.
> With this new syntax we want a clean break from the current limitations to 
> make
> something that is hopefully easier to use and maintain.
>
> The idea behind this compact syntax is that often times it's quite hard to
> correlate the entries in the constrains list, attributes and instruction 
> lists.
>
> One has to count and this often is tedious.  Additionally when changing a 
> single
> line in the insn multiple lines in a diff change, making it harder to see 
> what's
> going on.
>
> This new syntax takes into account many of the common things that are done in 
> MD
> files.   It's also worth saying that this version is intended to deal with the
> common case of a string based alternatives.   For C chunks we have some ideas
> but those are not intended to be addressed here.
>
> It's easiest to explain with an example:
>
> normal syntax:
>
> (define_insn_and_split "*movsi_aarch64"
>   [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m,  
> r,  r,  r, w,r,w, w")
>   (match_operand:SI 1 "aarch64_mov_operand"  " 
> r,r,k,M,n,Usv,m,m,rZ,w,Usw,Usa,Ush,rZ,w,w,Ds"))]
>   "(register_operand (operands[0], SImode)
> || aarch64_reg_or_zero (operands[1], SImode))"
>   "@
>mov\\t%w0, %w1
>mov\\t%w0, %w1
>mov\\t%w0, %w1
>mov\\t%w0, %1
>#
>* return aarch64_output_sve_cnt_immediate (\"cnt\", \"%x0\", operands[1]);
>ldr\\t%w0, %1
>ldr\\t%s0, %1
>str\\t%w1, %0
>str\\t%s1, %0
>adrp\\t%x0, %A1\;ldr\\t%w0, [%x0, %L1]
>adr\\t%x0, %c1
>adrp\\t%x0, %A1
>fmov\\t%s0, %w1
>fmov\\t%w0, %s1
>fmov\\t%s0, %s1
>* return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
>   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
> SImode)
> && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>[(const_int 0)]
>"{
>aarch64_expand_mov_immediate (operands[0], operands[1]);
>DONE;
> }"
>   ;; The "mov_imm" type for CNT is just a placeholder.
>   [(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,mov_imm,mov_imm,load_4,
>   
> load_4,store_4,store_4,load_4,adr,adr,f_mcr,f_mrc,fmov,neon_move")
>(set_attr "arch"   "*,*,*,*,*,sve,*,fp,*,fp,*,*,*,fp,fp,fp,simd")
>(set_attr "length" "4,4,4,4,*,  4,4, 4,4, 4,8,4,4, 4, 4, 4,   4")
> ]
> )
>
> New syntax:
>
> (define_insn_and_split "*movsi_aarch64"
>   [(set (match_operand:SI 0 "nonimmediate_operand")
>   (match_operand:SI 1 "aarch64_mov_operand"))]
>   "(register_operand (operands[0], SImode)
> || aarch64_reg_or_zero (operands[1], SImode))"
>   "@@ (cons: 0 1; attrs: type arch length)
>[=r, r  ; mov_reg  , *   , 4] mov\t%w0, %w1
>[k , r  ; mov_reg  , *   , 4] ^
>[r , k  ; mov_reg  , *   , 4] ^
>[r , M  ; mov_imm  , *   , 4] mov\t%w0, %1
>[r , n  ; mov_imm  , *   , *] #
>[r , Usv; mov_imm  , sve , 4] << aarch64_output_sve_cnt_immediate ('cnt', 
> '%x0', operands[1]);
>[r , m  ; load_4   , *   , 4] ldr\t%w0, %1
>[w , m  ; load_4   , fp  , 4] ldr\t%s0, %1
>[m , rZ ; store_4  , *   , 4] str\t%w1, %0
>[m , w  ; store_4  , fp  , 4] str\t%s1, %0
>[r , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
>[r , Usa; adr  , *   , 4] adr\t%x0, %c1
>[r , Ush; adr  , *   , 4] adrp\t%x0, %A1
>[w , rZ ; f_mcr, fp  , 4] fmov\t%s0, %w1
>[r , w  ; f_mrc, fp  , 4] fmov\t%w0, %s1
>[w , w  ; fmov , fp  , 4] fmov\t%s0, %s1
>[w , Ds ; neon_move, simd, 4] << aarch64_output_scalar_simd_mov_immediate 
> (operands[1], SImode);"
>   "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
> SImode)
> && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>   [(const_int 0)]
>   {
> aarch64_expand_mov_immediate (operands[0], operands[1]);
> DONE;
>   }
>   ;; The "mov_imm" type for CNT is just a placeholder.
> )
>
> The patch contains some more rewritten examples for both Arm and AArch64.  I
> have included them for examples in this RFC but the final version posted in
> GCC 14 will have these split out.
>
> The main syntax rules are as follows (See docs for full rules):
>   - Template must start with "@@" to use the new syntax.
>   - "@@" is followed by a layout in parentheses which is "cons:" followed by
> a list of match_operand/match_scratch IDs, then a semicolon, then the
> same for attributes ("attrs:"). Both sections are optional (so you can
> use only cons, or only attrs, or both), and cons must come before attrs
> if present.
>   - Each alternative begins with any amount of whitespace.
>   - Following the whitespace is a comma-separated list of constraints and/or
> attributes within brackets [], with sections 

Re: [PATCH] tree, c++: declare some basic functions inline

2023-04-21 Thread Patrick Palka via Gcc-patches
On Sun, 4 Dec 2022, Patrick Palka wrote:

> The functions strip_array_types, is_typedef_decl, typedef_variant_p,
> cp_type_quals and cp_expr_location are used throughout the C++ frontend
> including in some fairly hot parts (e.g. in the tsubst routines and
> cp_walk_subtree) and they're small enough that the overhead of calling
> them out-of-line is relatively significant.
> 
> This patch moves their definitions into the appropriate headers to
> enable inlining them.  This speeds up the C++ frontend by ~1% according
> to my experiments.  In passing this also downgrades the assert in
> cp_type_quals to a checking assert.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK
> for trunk stage3 or perhaps for stage1?

Ping.

> 
> gcc/cp/ChangeLog:
> 
>   * cp-tree.h (cp_type_quals): Define here.  Downgrade assert into
>   a checking assert.
>   (cp_expr_location): Define here.
>   * tree.cc (cp_expr_location): Don't define here.
>   * typeck.cc (cp_type_quals): Don't define here.
> 
> gcc/ChangeLog:
> 
>   * tree.cc (strip_array_types): Don't define here.
>   (is_typedef_decl): Don't define here.
>   (typedef_variant_p): Don't define here.
>   * tree.h (strip_array_types): Define here.
>   (is_typedef_decl): Define here.
>   (typedef_variant_p): Define here.
> ---
>  gcc/cp/cp-tree.h | 50 ++--
>  gcc/cp/tree.cc   | 24 ---
>  gcc/cp/typeck.cc | 23 --
>  gcc/tree.cc  | 29 
>  gcc/tree.h   | 32 ---
>  5 files changed, 77 insertions(+), 81 deletions(-)
> 
> diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
> index addd26ea077..19914d08a03 100644
> --- a/gcc/cp/cp-tree.h
> +++ b/gcc/cp/cp-tree.h
> @@ -49,7 +49,7 @@ c-common.h, not after.
> but not all node kinds do (e.g. constants, and references to
> params, locals, etc), so we stash a copy here.  */
>  
> -extern location_t cp_expr_location   (const_tree);
> +inline location_t cp_expr_location   (const_tree);
>  
>  class cp_expr
>  {
> @@ -8100,7 +8100,6 @@ extern bool error_type_p
> (const_tree);
>  extern bool ptr_reasonably_similar   (const_tree, const_tree);
>  extern tree build_ptrmemfunc (tree, tree, int, bool,
>tsubst_flags_t);
> -extern int cp_type_quals (const_tree);
>  extern int type_memfn_quals  (const_tree);
>  extern cp_ref_qualifier type_memfn_rqual (const_tree);
>  extern tree apply_memfn_quals(tree, cp_cv_quals,
> @@ -8151,6 +8150,29 @@ extern void maybe_warn_about_useless_cast   
> (location_t, tree, tree,
>tsubst_flags_t);
>  extern tree cp_perform_integral_promotions  (tree, tsubst_flags_t);
>  
> +/* Returns the type qualifiers for this type, including the qualifiers on the
> +   elements for an array type.  */
> +
> +inline int
> +cp_type_quals (const_tree type)
> +{
> +  int quals;
> +  /* This CONST_CAST is okay because strip_array_types returns its
> + argument unmodified and we assign it to a const_tree.  */
> +  type = strip_array_types (CONST_CAST_TREE (type));
> +  if (type == error_mark_node
> +  /* Quals on a FUNCTION_TYPE are memfn quals.  */
> +  || TREE_CODE (type) == FUNCTION_TYPE)
> +return TYPE_UNQUALIFIED;
> +  quals = TYPE_QUALS (type);
> +  /* METHOD and REFERENCE_TYPEs should never have quals.  */
> +  gcc_checking_assert ((TREE_CODE (type) != METHOD_TYPE
> + && !TYPE_REF_P (type))
> +|| ((quals & (TYPE_QUAL_CONST|TYPE_QUAL_VOLATILE))
> +== TYPE_UNQUALIFIED));
> +  return quals;
> +}
> +
>  extern tree finish_left_unary_fold_expr  (tree, int);
>  extern tree finish_right_unary_fold_expr (tree, int);
>  extern tree finish_binary_fold_expr  (tree, tree, int);
> @@ -8168,6 +8190,30 @@ loc_or_input_loc (location_t loc)
>return loc == UNKNOWN_LOCATION ? input_location : loc;
>  }
>  
> +/* Like EXPR_LOCATION, but also handle some tcc_exceptional that have
> +   locations.  */
> +
> +inline location_t
> +cp_expr_location (const_tree t_)
> +{
> +  tree t = CONST_CAST_TREE (t_);
> +  if (t == NULL_TREE)
> +return UNKNOWN_LOCATION;
> +  switch (TREE_CODE (t))
> +{
> +case LAMBDA_EXPR:
> +  return LAMBDA_EXPR_LOCATION (t);
> +case STATIC_ASSERT:
> +  return STATIC_ASSERT_SOURCE_LOCATION (t);
> +case TRAIT_EXPR:
> +  return TRAIT_EXPR_LOCATION (t);
> +case PTRMEM_CST:
> +  return PTRMEM_CST_LOCATION (t);
> +default:
> +  return EXPR_LOCATION (t);
> +}
> +}
> +
>  inline location_t
>  cp_expr_loc_or_loc (const_tree t, location_t or_loc)
>  {
> diff --git a/gcc/cp/tree.cc b/gcc/cp/tree.cc
> index 1487f4975c5..4066b014f6e 100644
> --- 

Re: [PATCH] tree, c++: optimize walk_tree_1 and cp_walk_subtrees

2023-04-21 Thread Patrick Palka via Gcc-patches
On Mon, 5 Dec 2022, Jason Merrill wrote:

> On 12/5/22 06:09, Prathamesh Kulkarni wrote:
> > On Mon, 5 Dec 2022 at 09:51, Patrick Palka via Gcc-patches
> >  wrote:
> > > 
> > > These functions currently repeatedly dereference tp during the subtree
> > > walk, dereferences which the compiler can't CSE because it can't
> > > guarantee that the subtree walking doesn't modify *tp.
> > > 
> > > But we already implicitly require that TREE_CODE (*tp) remains the same
> > > throughout the subtree walks, so it doesn't seem to be a huge leap to
> > > strengthen that to requiring *tp remains the same.
> > Hi Patrick,
> > Just wondering in that case, if marking tp with const_tree *, instead
> > of tree *, would perhaps help the compiler
> > for CSEing some of the dereferences to *tp ?
> 
> That wouldn't make a difference; even if *tp is const, the compiler can't be
> sure that a call won't change it.  And const_tree * doesn't even make *tp
> const, it makes **tp const.
> 
> > > So this patch manually CSEs the dereferences of *tp.  This means that
> > > the callback function can no longer replace *tp with another tree (of
> > > the same TREE_CODE) when walking one of its subtrees, but that doesn't
> > > sound like a useful feature anyway.  This speeds up the C++ frontend by
> > > about ~1.5% according to my experiments.
> > > 
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for
> > > trunk stage3 or perhaps for stage1?
> 
> OK for stage 1.

Thanks, I just pushed this along with a drive-by change to
cp_walk_subtrees  to use cp_unevaluated instead
of incrementing cp_unevaluated_operand directly, which allows us to
safely use WALK_SUBTREE in those cases:

-- >8 --

Subject: [PATCH] c++, tree: optimize walk_tree_1 and cp_walk_subtrees

gcc/cp/ChangeLog:

* tree.cc (cp_walk_subtrees): Avoid repeatedly dereferencing tp.
: Use cp_unevaluated and WALK_SUBTREE.
: Likewise.

gcc/ChangeLog:

* tree.cc (walk_tree_1): Avoid repeatedly dereferencing tp
and type_p.
---
 gcc/cp/tree.cc | 134 +
 gcc/tree.cc| 103 +++--
 2 files changed, 119 insertions(+), 118 deletions(-)

diff --git a/gcc/cp/tree.cc b/gcc/cp/tree.cc
index 69852538894..d35e30faf28 100644
--- a/gcc/cp/tree.cc
+++ b/gcc/cp/tree.cc
@@ -5445,7 +5445,8 @@ tree
 cp_walk_subtrees (tree *tp, int *walk_subtrees_p, walk_tree_fn func,
  void *data, hash_set *pset)
 {
-  enum tree_code code = TREE_CODE (*tp);
+  tree t = *tp;
+  enum tree_code code = TREE_CODE (t);
   tree result;
 
 #define WALK_SUBTREE(NODE) \
@@ -5456,7 +5457,7 @@ cp_walk_subtrees (tree *tp, int *walk_subtrees_p, 
walk_tree_fn func,
 }  \
   while (0)
 
-  if (TYPE_P (*tp))
+  if (TYPE_P (t))
 {
   /* If *WALK_SUBTREES_P is 1, we're interested in the syntactic form of
 the argument, so don't look through typedefs, but do walk into
@@ -5468,15 +5469,15 @@ cp_walk_subtrees (tree *tp, int *walk_subtrees_p, 
walk_tree_fn func,
 
 See find_abi_tags_r for an example of setting *WALK_SUBTREES_P to 2
 when that's the behavior the walk_tree_fn wants.  */
-  if (*walk_subtrees_p == 1 && typedef_variant_p (*tp))
+  if (*walk_subtrees_p == 1 && typedef_variant_p (t))
{
- if (tree ti = TYPE_ALIAS_TEMPLATE_INFO (*tp))
+ if (tree ti = TYPE_ALIAS_TEMPLATE_INFO (t))
WALK_SUBTREE (TI_ARGS (ti));
  *walk_subtrees_p = 0;
  return NULL_TREE;
}
 
-  if (tree ti = TYPE_TEMPLATE_INFO (*tp))
+  if (tree ti = TYPE_TEMPLATE_INFO (t))
WALK_SUBTREE (TI_ARGS (ti));
 }
 
@@ -5486,8 +5487,8 @@ cp_walk_subtrees (tree *tp, int *walk_subtrees_p, 
walk_tree_fn func,
   switch (code)
 {
 case TEMPLATE_TYPE_PARM:
-  if (template_placeholder_p (*tp))
-   WALK_SUBTREE (CLASS_PLACEHOLDER_TEMPLATE (*tp));
+  if (template_placeholder_p (t))
+   WALK_SUBTREE (CLASS_PLACEHOLDER_TEMPLATE (t));
   /* Fall through.  */
 case DEFERRED_PARSE:
 case TEMPLATE_TEMPLATE_PARM:
@@ -5501,63 +5502,63 @@ cp_walk_subtrees (tree *tp, int *walk_subtrees_p, 
walk_tree_fn func,
   break;
 
 case TYPENAME_TYPE:
-  WALK_SUBTREE (TYPE_CONTEXT (*tp));
-  WALK_SUBTREE (TYPENAME_TYPE_FULLNAME (*tp));
+  WALK_SUBTREE (TYPE_CONTEXT (t));
+  WALK_SUBTREE (TYPENAME_TYPE_FULLNAME (t));
   *walk_subtrees_p = 0;
   break;
 
 case BASELINK:
-  if (BASELINK_QUALIFIED_P (*tp))
-   WALK_SUBTREE (BINFO_TYPE (BASELINK_ACCESS_BINFO (*tp)));
-  WALK_SUBTREE (BASELINK_FUNCTIONS (*tp));
+  if (BASELINK_QUALIFIED_P (t))
+   WALK_SUBTREE (BINFO_TYPE (BASELINK_ACCESS_BINFO (t)));
+  WALK_SUBTREE (BASELINK_FUNCTIONS (t));
   *walk_subtrees_p = 0;
   break;
 
 case PTRMEM_CST:
-  WALK_SUBTREE (TREE_TYPE (*tp));
+  WALK_SUBTREE 

Re: [match.pd] [SVE] Add pattern to transform svrev(svrev(v)) --> v

2023-04-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Wed, 19 Apr 2023 at 16:17, Richard Biener  
> wrote:
>>
>> On Wed, Apr 19, 2023 at 11:21 AM Prathamesh Kulkarni
>>  wrote:
>> >
>> > On Tue, 11 Apr 2023 at 19:36, Prathamesh Kulkarni
>> >  wrote:
>> > >
>> > > On Tue, 11 Apr 2023 at 14:17, Richard Biener 
>> > >  wrote:
>> > > >
>> > > > On Wed, Apr 5, 2023 at 10:39 AM Prathamesh Kulkarni via Gcc-patches
>> > > >  wrote:
>> > > > >
>> > > > > Hi,
>> > > > > For the following test:
>> > > > >
>> > > > > svint32_t f(svint32_t v)
>> > > > > {
>> > > > >   return svrev_s32 (svrev_s32 (v));
>> > > > > }
>> > > > >
>> > > > > We generate 2 rev instructions instead of nop:
>> > > > > f:
>> > > > > rev z0.s, z0.s
>> > > > > rev z0.s, z0.s
>> > > > > ret
>> > > > >
>> > > > > The attached patch tries to fix that by trying to recognize the 
>> > > > > following
>> > > > > pattern in match.pd:
>> > > > > v1 = VEC_PERM_EXPR (v0, v0, mask)
>> > > > > v2 = VEC_PERM_EXPR (v1, v1, mask)
>> > > > > -->
>> > > > > v2 = v0
>> > > > > if mask is { nelts - 1, nelts - 2, nelts - 3, ... }
>> > > > >
>> > > > > Code-gen with patch:
>> > > > > f:
>> > > > > ret
>> > > > >
>> > > > > Bootstrap+test passes on aarch64-linux-gnu, and SVE bootstrap in 
>> > > > > progress.
>> > > > > Does it look OK for stage-1 ?
>> > > >
>> > > > I didn't look at the patch but 
>> > > > tree-ssa-forwprop.cc:simplify_permutation should
>> > > > handle two consecutive permutes with the 
>> > > > is_combined_permutation_identity
>> > > > which might need tweaking for VLA vectors
>> > > Hi Richard,
>> > > Thanks for the suggestions. The attached patch modifies
>> > > is_combined_permutation_identity
>> > > to recognize the above pattern.
>> > > Does it look OK ?
>> > > Bootstrap+test in progress on aarch64-linux-gnu and x86_64-linux-gnu.
>> > Hi,
>> > ping https://gcc.gnu.org/pipermail/gcc-patches/2023-April/615502.html
>>
>> Can you instead of def_stmt pass in a bool whether rhs1 is equal to rhs2
>> and amend the function comment accordingly, say,
>>
>>   tem = VEC_PERM ;
>>   res = VEC_PERM ;
>>
>> SAME_P specifies whether op0 and op1 compare equal.  */
>>
>> +  if (def_stmt)
>> +gcc_checking_assert (is_gimple_assign (def_stmt)
>> +&& gimple_assign_rhs_code (def_stmt) == 
>> VEC_PERM_EXPR);
>> this is then unnecessary
>>
>>mask = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (mask1), mask1, mask1, 
>> mask2);
>> +
>> +  /* For VLA masks, check for the following pattern:
>> + v1 = VEC_PERM_EXPR (v0, v0, mask)
>> + v2 = VEC_PERM_EXPR (v1, v1, mask)
>> + -->
>> + v2 = v0
>>
>> you are not using 'mask' so please defer fold_ternary until after your
>> special-case.
>>
>> +  if (operand_equal_p (mask1, mask2, 0)
>> +  && !VECTOR_CST_NELTS (mask1).is_constant ()
>> +  && def_stmt
>> +  && operand_equal_p (gimple_assign_rhs1 (def_stmt),
>> + gimple_assign_rhs2 (def_stmt), 0))
>> +{
>> +  vec_perm_builder builder;
>> +  if (tree_to_vec_perm_builder (, mask1))
>> +   {
>> + poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask1));
>> + vec_perm_indices sel (builder, 1, nelts);
>> + if (sel.series_p (0, 1, nelts - 1, -1))
>> +   return 1;
>> +   }
>> +  return 0;
>>
>> I'm defering to Richard whether this is the correct way to check for a vector
>> reversing mask (I wonder how constructing such mask is even possible)
> Hi Richard,
> Thanks for the suggestions, I have updated the patch accordingly.
>
> The following hunk from svrev_impl::fold() constructs mask in reverse:
> /* Permute as { nelts - 1, nelts - 2, nelts - 3, ... }.  */
> poly_int64 nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (f.lhs));
> vec_perm_builder builder (nelts, 1, 3);
> for (int i = 0; i < 3; ++i)
>   builder.quick_push (nelts - i - 1);
> return fold_permute (f, builder);
>
> To see if mask chooses elements in reverse, I borrowed it from function 
> comment
> for series_p in vec-perm-indices.cc:
> /* Return true if index OUT_BASE + I * OUT_STEP selects input
>element IN_BASE + I * IN_STEP.  For example, the call to test
>whether a permute reverses a vector of N elements would be:
>
>  series_p (0, 1, N - 1, -1)
>
>which would return true for { N - 1, N - 2, N - 3, ... }.  */
>
> Thanks,
> Prathamesh
>>
>> Richard.
>>
>> > Thanks,
>> > Prathamesh
>> > >
>> > > Thanks,
>> > > Prathamesh
>> > > >
>> > > > Richard.
>> > > >
>> > > > >
>> > > > > Thanks,
>> > > > > Prathamesh
>
> gcc/ChangeLog:
>   * tree-ssa-forwprop.cc (is_combined_permutation_identity):
>   New parameter same_p.
>   Try to simplify two successive VEC_PERM_EXPRs with single operand
>   and same mask, where mask chooses elements in reverse order.
>
> gcc/testesuite/ChangeLog:
>   * gcc.target/aarch64/sve/acle/general/rev-1.c: New test.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/rev-1.c 
> 

[committed] expansion: make layout of x_shift*cost[][][] more efficient

2023-04-21 Thread Vineet Gupta

On 4/19/23 00:05, Richard Biener wrote:

On Tue, Apr 18, 2023 at 10:51 PM Vineet Gupta  wrote:

when debugging expmed.[ch] for PR/108987 saw that some of the cost arrays have
less than ideal layout as follows:

x_shift*cost[0..63][speed][modes]

We would want speed to be first index since a typical compile will have
that fixed, followed by mode and then the shift values.

It should be non-functional from compiler semantics pov, except
executing slightly faster due to better locality of shift values for
given speed and mode. And also a bit more intutive when debugging.

OK, but please wait 24h in case somebody else wants to comment.


Pushed.

Thx,
-Vineet




Thanks,
Richard.


gcc/Changelog:

 * expmed.h (x_shift*_cost): convert to int [speed][mode][shift].
 (shift*_cost_ptr ()): Access x_shift*_cost array directly.

Signed-off-by: Vineet Gupta 
---
Changes since v1:
- Post a non stale version of patch
---
  gcc/expmed.h | 27 +--
  1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/gcc/expmed.h b/gcc/expmed.h
index c747a0da1637..22ae1d2d0743 100644
--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -161,15 +161,14 @@ struct target_expmed {
struct expmed_op_cheap x_sdiv_pow2_cheap;
struct expmed_op_cheap x_smod_pow2_cheap;

-  /* Cost of various pieces of RTL.  Note that some of these are indexed by
- shift count and some by mode.  */
+  /* Cost of various pieces of RTL.  */
int x_zero_cost[2];
struct expmed_op_costs x_add_cost;
struct expmed_op_costs x_neg_cost;
-  struct expmed_op_costs x_shift_cost[MAX_BITS_PER_WORD];
-  struct expmed_op_costs x_shiftadd_cost[MAX_BITS_PER_WORD];
-  struct expmed_op_costs x_shiftsub0_cost[MAX_BITS_PER_WORD];
-  struct expmed_op_costs x_shiftsub1_cost[MAX_BITS_PER_WORD];
+  int x_shift_cost[2][NUM_MODE_IPV_INT][MAX_BITS_PER_WORD];
+  int x_shiftadd_cost[2][NUM_MODE_IPV_INT][MAX_BITS_PER_WORD];
+  int x_shiftsub0_cost[2][NUM_MODE_IPV_INT][MAX_BITS_PER_WORD];
+  int x_shiftsub1_cost[2][NUM_MODE_IPV_INT][MAX_BITS_PER_WORD];
struct expmed_op_costs x_mul_cost;
struct expmed_op_costs x_sdiv_cost;
struct expmed_op_costs x_udiv_cost;
@@ -395,8 +394,8 @@ neg_cost (bool speed, machine_mode mode)
  inline int *
  shift_cost_ptr (bool speed, machine_mode mode, int bits)
  {
-  return expmed_op_cost_ptr (_target_expmed->x_shift_cost[bits],
-speed, mode);
+  int midx = expmed_mode_index (mode);
+  return _target_expmed->x_shift_cost[speed][midx][bits];
  }

  /* Set the COST of doing a shift in MODE by BITS when optimizing for SPEED.  
*/
@@ -421,8 +420,8 @@ shift_cost (bool speed, machine_mode mode, int bits)
  inline int *
  shiftadd_cost_ptr (bool speed, machine_mode mode, int bits)
  {
-  return expmed_op_cost_ptr (_target_expmed->x_shiftadd_cost[bits],
-speed, mode);
+  int midx = expmed_mode_index (mode);
+  return _target_expmed->x_shiftadd_cost[speed][midx][bits];
  }

  /* Set the COST of doing a shift in MODE by BITS followed by an add when
@@ -448,8 +447,8 @@ shiftadd_cost (bool speed, machine_mode mode, int bits)
  inline int *
  shiftsub0_cost_ptr (bool speed, machine_mode mode, int bits)
  {
-  return expmed_op_cost_ptr (_target_expmed->x_shiftsub0_cost[bits],
-speed, mode);
+  int midx = expmed_mode_index (mode);
+  return _target_expmed->x_shiftsub0_cost[speed][midx][bits];
  }

  /* Set the COST of doing a shift in MODE by BITS and then subtracting a
@@ -475,8 +474,8 @@ shiftsub0_cost (bool speed, machine_mode mode, int bits)
  inline int *
  shiftsub1_cost_ptr (bool speed, machine_mode mode, int bits)
  {
-  return expmed_op_cost_ptr (_target_expmed->x_shiftsub1_cost[bits],
-speed, mode);
+  int midx = expmed_mode_index (mode);
+  return _target_expmed->x_shiftsub1_cost[speed][midx][bits];
  }

  /* Set the COST of subtracting a shift in MODE by BITS from a value when
--
2.34.1





[committed] MAINTAINERS: add Vineet Gupta to write after approval

2023-04-21 Thread Vineet Gupta

On 4/21/23 09:03, Kito Cheng wrote:

You need use git+ssh protocol, I use this way to manage that:

git remote add upstream-write git+ssh://@gcc.gnu.org/git/gcc.git
git push upstream-write master


Thx Kito. That worked. I'll try to update the wiki.

-Vineet


On Sat, Apr 22, 2023 at 12:00 AM Vineet Gupta  wrote:


On 4/21/23 02:30, Richard Sandiford wrote:

No approval is needed when adding oneself to write-after-approval.
The fact that one's able to make the change is proof enough.

Thx Richard.

Noob question: I tried to commit/push but failed.

| $ git remote show upstream
| * remote upstream
|  Fetch URL: git://gcc.gnu.org/git/gcc.git
|
| $ git push upstream upstream-exp:master
| fatal: remote error: service not enabled: /git/gcc.git

Reading thru [1] it seems commit on trunk is only possible via git svn.
Is that true ?

Thx,
-Vineet

[1] https://gcc.gnu.org/wiki/GitMirror




Re: [PATCH] MAINTAINERS: add Vineet Gupta to write after approval

2023-04-21 Thread Kito Cheng via Gcc-patches
You need use git+ssh protocol, I use this way to manage that:

git remote add upstream-write git+ssh://@gcc.gnu.org/git/gcc.git
git push upstream-write master


On Sat, Apr 22, 2023 at 12:00 AM Vineet Gupta  wrote:
>
>
> On 4/21/23 02:30, Richard Sandiford wrote:
> > No approval is needed when adding oneself to write-after-approval.
> > The fact that one's able to make the change is proof enough.
>
> Thx Richard.
>
> Noob question: I tried to commit/push but failed.
>
> | $ git remote show upstream
> | * remote upstream
> |  Fetch URL: git://gcc.gnu.org/git/gcc.git
> |
> | $ git push upstream upstream-exp:master
> | fatal: remote error: service not enabled: /git/gcc.git
>
> Reading thru [1] it seems commit on trunk is only possible via git svn.
> Is that true ?
>
> Thx,
> -Vineet
>
> [1] https://gcc.gnu.org/wiki/GitMirror


Re: [PATCH] MAINTAINERS: add Vineet Gupta to write after approval

2023-04-21 Thread Vineet Gupta



On 4/21/23 02:30, Richard Sandiford wrote:

No approval is needed when adding oneself to write-after-approval.
The fact that one's able to make the change is proof enough.


Thx Richard.

Noob question: I tried to commit/push but failed.

| $ git remote show upstream
| * remote upstream
|  Fetch URL: git://gcc.gnu.org/git/gcc.git
|
| $ git push upstream upstream-exp:master
| fatal: remote error: service not enabled: /git/gcc.git

Reading thru [1] it seems commit on trunk is only possible via git svn.
Is that true ?

Thx,
-Vineet

[1] https://gcc.gnu.org/wiki/GitMirror


Re: [PATCH v2] Leveraging the use of STP instruction for vec_duplicate

2023-04-21 Thread Richard Sandiford via Gcc-patches
"Victor L. Do Nascimento"  writes:
> The backend pattern for storing a pair of identical values in 32 and
> 64-bit modes with the machine instruction STP was missing, and
> multiple instructions were needed to reproduce this behavior as a
> result of failed RTL pattern match in the combine pass.
>
> For the test case:
>
> typedef long long v2di __attribute__((vector_size (16)));
> typedef int v2si __attribute__((vector_size (8)));
>
> void
> foo (v2di *x, long long a)
> {
>   v2di tmp = {a, a};
>   *x = tmp;
> }
>
> void
> foo2 (v2si *x, int a)
> {
>   v2si tmp = {a, a};
>   *x = tmp;
> }
>
> at -O2 on aarch64 gives:
>
> foo
>   stp x1, x1, [x0]
>   ret
> foo2:
>   stp w1, w1, [x0]
>   ret
>
> instead of:
>
> foo:
>   dup v0.2d, x1
>   str q0, [x0]
>   ret
> foo2:
>   dup v0.2s, w1
>   str d0, [x0]
>   ret
>
> Bootstrapped and regtested on aarch64-none-linux-gnu.  Ok to install?
>
> gcc/
>   * config/aarch64/aarch64-simd.md(aarch64_simd_stp): New.
>   * config/aarch64/constraints.md: Make "Umn" relaxed memory
>   constraint.
>   * config/aarch64/iterators.md(ldpstp_vel_sz): New.
>
> gcc/testsuite/
>   * gcc.target/aarch64/stp_vec_dup_32_64-1.c:

Nit: missing text after ":"

OK to install with that fixed, thanks.  Please follow
https://gcc.gnu.org/gitwrite.html to get write access.

Richard

> ---
>  gcc/config/aarch64/aarch64-simd.md| 10 
>  gcc/config/aarch64/constraints.md |  2 +-
>  gcc/config/aarch64/iterators.md   |  3 +
>  .../gcc.target/aarch64/stp_vec_dup_32_64-1.c  | 57 +++
>  4 files changed, 71 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index de2b7383749..8b5e67bd100 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -229,6 +229,16 @@
>[(set_attr "type" "neon_stp")]
>  )
>  
> +(define_insn "aarch64_simd_stp"
> +  [(set (match_operand:VP_2E 0 "aarch64_mem_pair_lanes_operand" "=Umn,Umn")
> + (vec_duplicate:VP_2E (match_operand: 1 "register_operand" "w,r")))]
> +  "TARGET_SIMD"
> +  "@
> +   stp\\t%1, %1, %y0
> +   stp\\t%1, %1, %y0"
> +  [(set_attr "type" "neon_stp, store_")]
> +)
> +
>  (define_insn "load_pair"
>[(set (match_operand:VQ 0 "register_operand" "=w")
>   (match_operand:VQ 1 "aarch64_mem_pair_operand" "Ump"))
> diff --git a/gcc/config/aarch64/constraints.md 
> b/gcc/config/aarch64/constraints.md
> index 5b20abc27e5..6df1dbec2a8 100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -287,7 +287,7 @@
>  ;; Used for storing or loading pairs in an AdvSIMD register using an STP/LDP
>  ;; as a vector-concat.  The address mode uses the same constraints as if it
>  ;; were for a single value.
> -(define_memory_constraint "Umn"
> +(define_relaxed_memory_constraint "Umn"
>"@internal
>A memory address suitable for a load/store pair operation."
>(and (match_code "mem")
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 6cbc97cc82c..980dacb8025 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1017,6 +1017,9 @@
>  ;; Likewise for load/store pair.
>  (define_mode_attr ldpstp_sz [(SI "8") (DI "16")])
>  
> +;; Size of element access for STP/LDP-generated vectors.
> +(define_mode_attr ldpstp_vel_sz [(V2SI "8") (V2SF "8") (V2DI "16") (V2DF 
> "16")])
> +
>  ;; For inequal width int to float conversion
>  (define_mode_attr w1 [(HF "w") (SF "w") (DF "x")])
>  (define_mode_attr w2 [(HF "x") (SF "x") (DF "w")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c 
> b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
> new file mode 100644
> index 000..fc2c1ea39e0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stp_vec_dup_32_64-1.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +typedef int v2si __attribute__((vector_size (8)));
> +
> +#define TESTV2DI(lab, idx)   \
> +  void   \
> +  stpv2di_##lab (v2di *x, long long a)   \
> +  {  \
> +v2di tmp = {a, a};   \
> +x[idx] = tmp;\
> +  }
> +
> +
> +#define TESTV2SI(lab, idx)   \
> +  void   \
> +  stpv2si_##lab (v2si *x, int a) \
> +  {  \
> +v2si tmp = {a, a};   \
> +x[idx] = tmp;\
> +  }  \
> +
> +/* Core test, no imm assembler offset:  */
> +
> +TESTV2SI(0, 0)
> +TESTV2DI(0, 0)
> +/* { dg-final { 

Re: [aarch64] Use force_reg instead of copy_to_mode_reg

2023-04-21 Thread Prathamesh Kulkarni via Gcc-patches
On Fri, 21 Apr 2023 at 21:00, Richard Sandiford
 wrote:
>
> Prathamesh Kulkarni  writes:
> > Hi Richard,
> > Based on your suggestions in the other thread, the patch uses force_reg
> > to avoid creating pseudo if value is already in a register.
> > Bootstrap+test passes on aarch64-linux-gnu.
> > OK to commit ?
> >
> > Thanks,
> > Prathamesh
> >
> > [aarch64] Use force_reg instead of copy_to_mode_reg.
> >
> > Use force_reg instead of copy_to_mode_reg in aarch64_simd_dup_constant
> > and aarch64_expand_vector_init to avoid creating pseudo if original value
> > is already in a register.
> >
> > gcc/ChangeLog:
> >   * config/aarch64/aarch64.cc (aarch64_simd_dup_constant): Use
> >   force_reg instead of copy_to_mode_reg.
> >   (aarch64_expand_vector_init): Likewise.
>
> OK, thanks.
Thanks, committed in:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=e306501ff556647dc31915a63ce95a5496f08f97

Thanks,
Prathamesh
>
> Richard
>
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 0d7470c05a1..321580d7f6a 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -21968,7 +21968,7 @@ aarch64_simd_dup_constant (rtx vals)
> >/* We can load this constant by using DUP and a constant in a
> >   single ARM register.  This will be cheaper than a vector
> >   load.  */
> > -  x = copy_to_mode_reg (inner_mode, x);
> > +  x = force_reg (inner_mode, x);
> >return gen_vec_duplicate (mode, x);
> >  }
> >
> > @@ -22082,7 +22082,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >/* Splat a single non-constant element if we can.  */
> >if (all_same)
> >  {
> > -  rtx x = copy_to_mode_reg (inner_mode, v0);
> > +  rtx x = force_reg (inner_mode, v0);
> >aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> >return;
> >  }
> > @@ -22190,12 +22190,12 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >vector register.  For big-endian we want that position to hold
> >the last element of VALS.  */
> > maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> > -   rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 
> > maxelement));
> > +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >   }
> >else
> >   {
> > -   rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 
> > maxelement));
> > +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> >   }
> >
> > @@ -22205,7 +22205,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> > rtx x = XVECEXP (vals, 0, i);
> > if (matches[i][0] == maxelement)
> >   continue;
> > -   x = copy_to_mode_reg (inner_mode, x);
> > +   x = force_reg (inner_mode, x);
> > emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
> >   }
> >return;
> > @@ -22249,7 +22249,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >rtx x = XVECEXP (vals, 0, i);
> >if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
> >   continue;
> > -  x = copy_to_mode_reg (inner_mode, x);
> > +  x = force_reg (inner_mode, x);
> >emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
> >  }
> >  }


Re: Fix loop-ch

2023-04-21 Thread Jan Hubicka via Gcc-patches
> Hi,
> Ondrej Kubanek implemented profiling of loop histograms which sould be useful 
> to improve
> i.e. quality of loop peeling of verctorization.  However it turns out that 
> most of histograms
> are lost on the way from profiling to loop peeling pass (about 90%).  One 
> common case is the
> following latent bug in loop header copying which forgets to update the loop 
> header pointer.
> 
> Curiously enough it does work to make single latch and preheader edge by 
> splitting basic blocks
> but it works with wrong edge.  As a consequence every loop where loop header 
> was copied is
> removed from loop tree and inserted again losing all metadata included.
> 
> Patch correctly updates the loop structure and also add verification
> that the loop tree is OK after all transforms which is failing without
> the patch.
> 
> Bootstrapped/regtested x86_64-linux, plan to insteall this as obvious.
> 
Hi,
sadly I managed to mix up patch and its WIP version in previous commit.
This patch adds the missing edge iterator and also fixes a side case
where new loop header would have multiple latches.

Bootstrapping/regtesting of x86_64-linux in progress.  It was tested
previously and passed, so I hope it will again and commit it to unbreak
master then.

gcc/ChangeLog:

* tree-ssa-loop-ch.cc (ch_base::copy_headers): Fix previous
commit.

diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 560df39893e..9487e7f3e55 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -484,7 +484,10 @@ ch_base::copy_headers (function *fun)
   /* Ensure that the header will have just the latch as a predecessor
 inside the loop.  */
   if (!single_pred_p (exit->dest))
-   exit = single_pred_edge (split_edge (exit));
+   {
+ header = split_edge (exit);
+ exit = single_pred_edge (header);
+   }
 
   entry = loop_preheader_edge (loop);
 
@@ -547,16 +550,17 @@ ch_base::copy_headers (function *fun)
   /* Find correct latch.  We only duplicate chain of conditionals so
 there should be precisely two edges to the new header.  One entry
 edge and one to latch.  */
+  edge_iterator ei;
+  edge e;
   FOR_EACH_EDGE (e, ei, loop->header->preds)
if (header != e->src)
  {
loop->latch = e->src;
break;
  }
-  /* Ensure that the latch and the preheader is simple (we know that they
-are not now, since there was the loop exit condition.  */
-  split_edge (loop_preheader_edge (loop));
-  split_edge (loop_latch_edge (loop));
+  /* Ensure that the latch is simple.  */
+  if (!single_succ_p (loop_latch_edge (loop)->src))
+   split_edge (loop_latch_edge (loop));
 
   if (dump_file && (dump_flags & TDF_DETAILS))
{


Re: [aarch64] Use force_reg instead of copy_to_mode_reg

2023-04-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi Richard,
> Based on your suggestions in the other thread, the patch uses force_reg
> to avoid creating pseudo if value is already in a register.
> Bootstrap+test passes on aarch64-linux-gnu.
> OK to commit ?
>
> Thanks,
> Prathamesh
>
> [aarch64] Use force_reg instead of copy_to_mode_reg.
>
> Use force_reg instead of copy_to_mode_reg in aarch64_simd_dup_constant
> and aarch64_expand_vector_init to avoid creating pseudo if original value
> is already in a register.
>
> gcc/ChangeLog:
>   * config/aarch64/aarch64.cc (aarch64_simd_dup_constant): Use
>   force_reg instead of copy_to_mode_reg.
>   (aarch64_expand_vector_init): Likewise.

OK, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 0d7470c05a1..321580d7f6a 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -21968,7 +21968,7 @@ aarch64_simd_dup_constant (rtx vals)
>/* We can load this constant by using DUP and a constant in a
>   single ARM register.  This will be cheaper than a vector
>   load.  */
> -  x = copy_to_mode_reg (inner_mode, x);
> +  x = force_reg (inner_mode, x);
>return gen_vec_duplicate (mode, x);
>  }
>  
> @@ -22082,7 +22082,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>/* Splat a single non-constant element if we can.  */
>if (all_same)
>  {
> -  rtx x = copy_to_mode_reg (inner_mode, v0);
> +  rtx x = force_reg (inner_mode, v0);
>aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>return;
>  }
> @@ -22190,12 +22190,12 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>vector register.  For big-endian we want that position to hold
>the last element of VALS.  */
> maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -   rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>   }
>else
>   {
> -   rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +   rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>   }
>  
> @@ -22205,7 +22205,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> rtx x = XVECEXP (vals, 0, i);
> if (matches[i][0] == maxelement)
>   continue;
> -   x = copy_to_mode_reg (inner_mode, x);
> +   x = force_reg (inner_mode, x);
> emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
>   }
>return;
> @@ -22249,7 +22249,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>rtx x = XVECEXP (vals, 0, i);
>if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
>   continue;
> -  x = copy_to_mode_reg (inner_mode, x);
> +  x = force_reg (inner_mode, x);
>emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
>  }
>  }


[PATCH] doc: Update install.texi for GCC 13

2023-04-21 Thread Rainer Orth
install.texi needs some updates for GCC 13 and trunk:

* We used a mixture of Solaris 2 and Solaris references.  Since Solaris
  1/SunOS 4 is ancient history by now, consistently use Solaris
  everywhere.  Likewise, explicit references to Solaris 11 can go in
  many places since Solaris 11.3 and 11.4 is all GCC supports.

* Some caveats apply to both Solaris/SPARC and x86, like the difference
  between as and gas.

* Some specifics are obsolete, like the /usr/ccs/bin path whose contents
  was merged into /usr/bin in Solaris 11.0 already.  Likewise, /bin/sh
  is ksh93 since Solaris 11.0, so there's no need to explicitly use
  /bin/ksh.

* There's little to no need to external sites to get additional
  packages.  OpenCSW that's still mentioned is mostly unmaintained these
  days and rather harmful then helping.  I've kept it since there's a
  version of GNAT 5 available that might be useful for Ada bootstrap on
  Solaris 11.3; however I've not actually tried that.

* The section on assembler and linker to use was partially duplicated.
  Better keep the info in one place.

* GNAT is bundled in recent Solaris 11.4 updates, so it's better to use
  that than one if possible rather than the ancient versions on OpenCSW.

Tested on i386-pc-solaris2.11 with make doc/gccinstall.{info,pdf} and
inspection of the latter.

Will commit to trunk soon.  Ok for the gcc-13 branch, too?

Thanks.
Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


2023-04-21  Rainer Orth  

gcc:
* doc/install.texi: Consistently use Solaris rather than Solaris 2.
Remove explicit Solaris 11 references.
Markup fixes.
(Options specification, --with-gnu-as): as and gas always differ
on Solaris.
Remove /usr/ccs/bin reference.
(Installing GCC: Binaries, Solaris (SPARC, Intel)): Warn about
OpenCSW.
(i?86-*-solaris2*): Merge assembler, linker recommendations into
*-*-solaris2* section.
(*-*-solaris2*): Update bundled GCC versions.
Remove /bin/sh warning.
Update assembler, linker recommendations.
Document GNAT bootstrap compiler.
(sparc-sun-solaris2*): Remove non-UltraSPARC reference.
(sparc64-*-solaris2*): Move content...
(sparcv9-*-solaris2*): ...here.
Add GDC for 64-bit bootstrap compilers.

# HG changeset patch
# Parent  20428970f3d1e321934c839526cfc0a10ecbfdc5
doc: Update install.texi for GCC 13

diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -370,7 +370,7 @@ systems' @command{tar} programs will als
 
 Necessary when targeting Darwin, building @samp{libstdc++},
 and not using @option{--disable-symvers}.
-Necessary when targeting Solaris 2 with Solaris @command{ld} and not using
+Necessary when targeting Solaris with Solaris @command{ld} and not using
 @option{--disable-symvers}.
 
 Necessary when regenerating @file{Makefile} dependencies in libiberty.
@@ -1098,8 +1098,7 @@ whether you use the GNU assembler.  On a
 @itemize @bullet
 @item @samp{hppa1.0-@var{any}-@var{any}}
 @item @samp{hppa1.1-@var{any}-@var{any}}
-@item @samp{sparc-sun-solaris2.@var{any}}
-@item @samp{sparc64-@var{any}-solaris2.@var{any}}
+@item @samp{*-*-solaris2.11}
 @end itemize
 
 @item @anchor{with-as}--with-as=@var{pathname}
@@ -1114,13 +1113,12 @@ Unless GCC is being built with a cross c
 @var{exec-prefix} defaults to @var{prefix}, which
 defaults to @file{/usr/local} unless overridden by the
 @option{--prefix=@var{pathname}} switch described above.  @var{target}
-is the target system triple, such as @samp{sparc-sun-solaris2.7}, and
+is the target system triple, such as @samp{sparc-sun-solaris2.11}, and
 @var{version} denotes the GCC version, such as 3.0.
 
 @item
 If the target system is the same that you are building on, check
-operating system specific directories (e.g.@: @file{/usr/ccs/bin} on
-Solaris 2).
+operating system specific directories.
 
 @item
 Check in the @env{PATH} for a tool whose name is prefixed by the
@@ -3570,10 +3568,11 @@ HP-UX:
 @end itemize
 
 @item
-Solaris 2 (SPARC, Intel):
+Solaris (SPARC, Intel):
 @itemize
 @item
-@uref{https://www.opencsw.org/,,OpenCSW}
+@uref{https://www.opencsw.org/,,OpenCSW}.  However, the packages there
+are mostly outdated or actually harmful on Solaris 11.3 and 11.4.
 @end itemize
 
 @item
@@ -3798,7 +3797,7 @@ information have to.
 
 @itemize
 @item
-@uref{#elf,,all ELF targets} (SVR4, Solaris 2, etc.)
+@uref{#elf,,all ELF targets} (SVR4, Solaris, etc.)
 @end itemize
 @end ifhtml
 
@@ -4262,24 +4261,6 @@ with GCC 4.7, there is also a 64-bit @sa
 @samp{x86_64-*-solaris2*} configuration that corresponds to
 @samp{sparcv9-sun-solaris2*}.
 
-It is recommended that you configure GCC to use the GNU assembler.  The
-versions included in Solaris 11.3, from GNU binutils 2.23.1 or
-newer (available as 

[PATCH v4 3/4] ree: Main functionality to improve ree pass for rs6000 target.

2023-04-21 Thread Ajit Agarwal via Gcc-patches
Hello All:

This patch is the new version of patch-3 to improve ree pass for rs6000 target.
Bootstrapped and regtested on power64-linux-gnu.

Thanks & Regards
Ajit

ree: Improve ree pass for rs6000 target

For rs6000 target we see redundant zero and sign
extension and done to improve ree pass to eliminate
such redundant zero and sign extension. Support of
zero_extend/sign_extend/AND.

2023-04-21  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (eliminate_across_bbs_p): Add checks to enable extension
elimination across and within basic blocks.
(def_arith_p): New function to check definition has arithmetic
operation.
(combine_set_extension): Modification to incorporate AND
and current zero_extend and sign_extend instruction.
(merge_def_and_ext): Add calls to eliminate_across_bbs_p and
zero_extend sign_extend and AND instruction.
(rtx_is_zext_p): New function.
(feasible_cfg): New function.
* rtl.h (reg_used_set_between_p): Add prototype.
* rtlanal.cc (reg_used_set_between_p): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim.C: New testcase.
* g++.target/powerpc/zext-elim-1.C: New testcase.
* g++.target/powerpc/zext-elim-2.C: New testcase.
* g++.target/powerpc/sext-elim.C: New testcase.
---
 gcc/ree.cc| 471 --
 gcc/rtl.h |   1 +
 gcc/rtlanal.cc|  15 +
 gcc/testsuite/g++.target/powerpc/sext-elim.C  |  18 +
 .../g++.target/powerpc/zext-elim-1.C  |  19 +
 .../g++.target/powerpc/zext-elim-2.C  |  11 +
 gcc/testsuite/g++.target/powerpc/zext-elim.C  |  30 ++
 7 files changed, 519 insertions(+), 46 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/sext-elim.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-2.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index 413aec7c8eb..96fda1ac658 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -253,6 +253,61 @@ struct ext_cand
 
 static int max_insn_uid;
 
+/* Return TRUE if OP can be considered a zero extension from one or
+   more sub-word modes to larger modes up to a full word.
+
+   For example (and:DI (reg) (const_int X))
+
+   Depending on the value of X could be considered a zero extension
+   from QI, HI and SI to larger modes up to DImode.  */
+
+static bool
+rtx_is_zext_p (rtx insn)
+{
+  if (GET_CODE (insn) == AND)
+{
+  rtx set = XEXP (insn, 0);
+  if (REG_P (set))
+   {
+  if (XEXP (insn, 1) == const1_rtx)
+return true;
+   }
+  else
+   return false;
+}
+
+  return false;
+}
+/* Return TRUE if OP can be considered a zero extension from one or
+   more sub-word modes to larger modes up to a full word.
+
+   For example (and:DI (reg) (const_int X))
+
+   Depending on the value of X could be considered a zero extension
+   from QI, HI and SI to larger modes up to DImode.  */
+
+static bool
+rtx_is_zext_p (rtx_insn *insn)
+{
+  rtx body = single_set (insn);
+
+  if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) == AND)
+   {
+ rtx set = XEXP (SET_SRC (body), 0);
+
+ if (REG_P (set) && GET_MODE (SET_DEST (body)) == GET_MODE (set))
+   {
+if (GET_MODE_UNIT_SIZE (GET_MODE (SET_DEST (body)))
+>= GET_MODE_UNIT_SIZE (GET_MODE (set)))
+  return true;
+   }
+ else
+  return false;
+   }
+
+   return false;
+}
+
 /* Update or remove REG_EQUAL or REG_EQUIV notes for INSN.  */
 
 static bool
@@ -319,7 +374,7 @@ combine_set_extension (ext_cand *cand, rtx_insn *curr_insn, 
rtx *orig_set)
 {
   rtx orig_src = SET_SRC (*orig_set);
   machine_mode orig_mode = GET_MODE (SET_DEST (*orig_set));
-  rtx new_set;
+  rtx new_set = NULL_RTX;
   rtx cand_pat = single_set (cand->insn);
 
   /* If the extension's source/destination registers are not the same
@@ -359,27 +414,41 @@ combine_set_extension (ext_cand *cand, rtx_insn 
*curr_insn, rtx *orig_set)
   else if (GET_CODE (orig_src) == cand->code)
 {
   /* Here is a sequence of two extensions.  Try to merge them.  */
-  rtx temp_extension
-   = gen_rtx_fmt_e (cand->code, cand->mode, XEXP (orig_src, 0));
+  rtx temp_extension = NULL_RTX;
+  if (GET_CODE (SET_SRC (cand_pat)) == AND)
+   temp_extension
+   = gen_rtx_AND (cand->mode, XEXP (orig_src, 0), XEXP (orig_src, 1));
+  else
+   temp_extension
+= gen_rtx_fmt_e (cand->code, cand->mode, XEXP (orig_src, 0));
   rtx simplified_temp_extension = simplify_rtx (temp_extension);
   if (simplified_temp_extension)
 temp_extension = simplified_temp_extension;
+
   new_set = gen_rtx_SET (new_reg, temp_extension);
 }
   else if (GET_CODE 

[aarch64] Use force_reg instead of copy_to_mode_reg

2023-04-21 Thread Prathamesh Kulkarni via Gcc-patches
Hi Richard,
Based on your suggestions in the other thread, the patch uses force_reg
to avoid creating pseudo if value is already in a register.
Bootstrap+test passes on aarch64-linux-gnu.
OK to commit ?

Thanks,
Prathamesh
[aarch64] Use force_reg instead of copy_to_mode_reg.

Use force_reg instead of copy_to_mode_reg in aarch64_simd_dup_constant
and aarch64_expand_vector_init to avoid creating pseudo if original value
is already in a register.

gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_simd_dup_constant): Use
force_reg instead of copy_to_mode_reg.
(aarch64_expand_vector_init): Likewise.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 0d7470c05a1..321580d7f6a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -21968,7 +21968,7 @@ aarch64_simd_dup_constant (rtx vals)
   /* We can load this constant by using DUP and a constant in a
  single ARM register.  This will be cheaper than a vector
  load.  */
-  x = copy_to_mode_reg (inner_mode, x);
+  x = force_reg (inner_mode, x);
   return gen_vec_duplicate (mode, x);
 }
 
@@ -22082,7 +22082,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   /* Splat a single non-constant element if we can.  */
   if (all_same)
 {
-  rtx x = copy_to_mode_reg (inner_mode, v0);
+  rtx x = force_reg (inner_mode, v0);
   aarch64_emit_move (target, gen_vec_duplicate (mode, x));
   return;
 }
@@ -22190,12 +22190,12 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 vector register.  For big-endian we want that position to hold
 the last element of VALS.  */
  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
- rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+ rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
}
   else
{
- rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+ rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
}
 
@@ -22205,7 +22205,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
  rtx x = XVECEXP (vals, 0, i);
  if (matches[i][0] == maxelement)
continue;
- x = copy_to_mode_reg (inner_mode, x);
+ x = force_reg (inner_mode, x);
  emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
}
   return;
@@ -22249,7 +22249,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   rtx x = XVECEXP (vals, 0, i);
   if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
continue;
-  x = copy_to_mode_reg (inner_mode, x);
+  x = force_reg (inner_mode, x);
   emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
 }
 }


Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

2023-04-21 Thread Prathamesh Kulkarni via Gcc-patches
On Fri, 21 Apr 2023 at 14:47, Richard Sandiford
 wrote:
>
> Prathamesh Kulkarni  writes:
> > Hi,
> > I tested the interleave+zip1 for vector init patch and it segfaulted
> > during bootstrap while trying to build
> > libgfortran/generated/matmul_i2.c.
> > Rebuilding with --enable-checking=rtl showed out of bounds access in
> > aarch64_unzip_vector_init in following hunk:
> >
> > +  rtvec vec = rtvec_alloc (n / 2);
> > +  for (int i = 0; i < n; i++)
> > +RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
> > + : XVECEXP (vals, 0, 2 * i + 1);
> >
> > which is incorrect since it allocates n/2 but iterates and stores upto n.
> > The attached patch fixes the issue, which passed bootstrap, however
> > resulted in following fallout during testsuite run:
> >
> > 1] sve/acle/general/dupq_[1-4].c tests fail.
> > For the following test:
> > int32x4_t f(int32_t x)
> > {
> >   return (int32x4_t) { x, 1, 2, 3 };
> > }
> >
> > Code-gen without patch:
> > f:
> > adrpx1, .LC0
> > ldr q0, [x1, #:lo12:.LC0]
> > ins v0.s[0], w0
> > ret
> >
> > Code-gen with patch:
> > f:
> > moviv0.2s, 0x2
> > adrpx1, .LC0
> > ldr d1, [x1, #:lo12:.LC0]
> > ins v0.s[0], w0
> > zip1v0.4s, v0.4s, v1.4s
> > ret
> >
> > It shows, fallback_seq_cost = 20, seq_total_cost = 16
> > where seq_total_cost determines the cost for interleave+zip1 sequence
> > and fallback_seq_cost is the cost for fallback sequence.
> > Altho it shows lesser cost, I am not sure if the interleave+zip1
> > sequence is better in this case ?
>
> Debugging the patch, it looks like this is because the fallback sequence
> contains a redundant pseudo-to-pseudo move, which is costed as 1
> instruction (4 units).  The RTL equivalent of the:
>
>  moviv0.2s, 0x2
>  ins v0.s[0], w0
>
> has a similar redundant move, but the cost of that move is subsumed by
> the cost of the other arm (the load from LC0), which is costed as 3
> instructions (12 units).  So we have 12 + 4 for the parallel version
> (correct) but 12 + 4 + 4 for the serial version (one instruction too
> many).
>
> The reason we have redundant moves is that the expansion code uses
> copy_to_mode_reg to force a value into a register.  This creates a
> new pseudo even if the original value was already a register.
> Using force_reg removes the moves and makes the test pass.
>
> So I think the first step is to use force_reg instead of
> copy_to_mode_reg in aarch64_simd_dup_constant and
> aarch64_expand_vector_init (as a preparatory patch).
Thanks for the clarification!
>
> > 2] sve/acle/general/dupq_[5-6].c tests fail:
> > int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3)
> > {
> >   return (int32x4_t) { x0, x1, x2, x3 };
> > }
> >
> > code-gen without patch:
> > f:
> > fmovs0, w0
> > ins v0.s[1], w1
> > ins v0.s[2], w2
> > ins v0.s[3], w3
> > ret
> >
> > code-gen with patch:
> > f:
> > fmovs0, w0
> > fmovs1, w1
> > ins v0.s[1], w2
> > ins v1.s[1], w3
> > zip1v0.4s, v0.4s, v1.4s
> > ret
> >
> > It shows fallback_seq_cost = 28, seq_total_cost = 16
>
> The zip verson still wins after the fix above, but by a lesser amount.
> It seems like a borderline case.
>
> >
> > 3] aarch64/ldp_stp_16.c's cons2_8_float test fails.
> > Test case:
> > void cons2_8_float(float *x, float val0, float val1)
> > {
> > #pragma GCC unroll(8)
> >   for (int i = 0; i < 8 * 2; i += 2) {
> > x[i + 0] = val0;
> > x[i + 1] = val1;
> >   }
> > }
> >
> > which is lowered to:
> > void cons2_8_float (float * x, float val0, float val1)
> > {
> >   vector(4) float _86;
> >
> >[local count: 119292720]:
> >   _86 = {val0_11(D), val1_13(D), val0_11(D), val1_13(D)};
> >   MEM  [(float *)x_10(D)] = _86;
> >   MEM  [(float *)x_10(D) + 16B] = _86;
> >   MEM  [(float *)x_10(D) + 32B] = _86;
> >   MEM  [(float *)x_10(D) + 48B] = _86;
> >   return;
> > }
> >
> > code-gen without patch:
> > cons2_8_float:
> > dup v0.4s, v0.s[0]
> > ins v0.s[1], v1.s[0]
> > ins v0.s[3], v1.s[0]
> > stp q0, q0, [x0]
> > stp q0, q0, [x0, 32]
> > ret
> >
> > code-gen with patch:
> > cons2_8_float:
> > dup v1.2s, v1.s[0]
> > dup v0.2s, v0.s[0]
> > zip1v0.4s, v0.4s, v1.4s
> > stp q0, q0, [x0]
> > stp q0, q0, [x0, 32]
> > ret
> >
> > It shows fallback_seq_cost = 28, seq_total_cost = 16
> >
> > I think the test fails because it doesn't match:
> > **  dup v([0-9]+)\.4s, .*
> >
> > Shall it be OK to amend the test assuming code-gen with patch is better ?
>
> Yeah, the new code seems like an improvement.
>
> > 4] aarch64/pr109072_1.c s32x4_3 test fails:
> > For the following test:
> > int32x4_t s32x4_3 (int32_t x, int32_t y)
> > {
> >   int32_t arr[] = 

[committed, gcc-12] libstdc++: Optimize std::try_facet and std::use_facet [PR103755]

2023-04-21 Thread Jonathan Wakely via Gcc-patches
Tested powerpc64le-linux. Pushed to gcc-12.

-- >8 --

The std::try_facet and std::use_facet functions were optimized in
r13-3888-gb3ac43a3c05744 to avoid redundant checking for all facets that
are required to always be present in every locale.

This performs a simpler version of the optimization that only applies to
std::ctype, std::num_get, std::num_put, and the
wchar_t specializations of those facets. Those are the facets that are
cached by std::basic_ios, which means they're used on construction for
every iostream object. This smaller change is suitable for the gcc-12
branch, and mitigates the performance loss for powerpc64le-linux on the
gcc-12 branch caused by r12-9454-g24cf9f4c6f45f7 for PR 103387. It also
greatly improves the performance of constructing iostreams objects, for
all targets.

libstdc++-v3/ChangeLog:

PR libstdc++/103755
* include/bits/locale_classes.tcc (try_facet, use_facet): Do not
check array index or dynamic type when accessing required
specializations of std::ctype, std::num_get, or std::num_put.
* testsuite/22_locale/ctype/is/string/89728_neg.cc: Adjust
expected errors.
---
 libstdc++-v3/include/bits/locale_classes.tcc  | 23 +++
 .../22_locale/ctype/is/string/89728_neg.cc|  1 +
 2 files changed, 24 insertions(+)

diff --git a/libstdc++-v3/include/bits/locale_classes.tcc 
b/libstdc++-v3/include/bits/locale_classes.tcc
index 64cd7534dc6..e6ce07ae8b7 100644
--- a/libstdc++-v3/include/bits/locale_classes.tcc
+++ b/libstdc++-v3/include/bits/locale_classes.tcc
@@ -103,6 +103,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 bool
 has_facet(const locale& __loc) throw()
 {
+  if _GLIBCXX17_CONSTEXPR (__is_same(_Facet, ctype)
+|| __is_same(_Facet, num_get)
+|| __is_same(_Facet, num_put))
+   return true;
+#ifdef _GLIBCXX_USE_WCHAR_T
+  else if _GLIBCXX17_CONSTEXPR (__is_same(_Facet, ctype)
+ || __is_same(_Facet, num_get)
+ || __is_same(_Facet, num_put))
+   return true;
+#endif
+
   const size_t __i = _Facet::id._M_id();
   const locale::facet** __facets = __loc._M_impl->_M_facets;
   return (__i < __loc._M_impl->_M_facets_size
@@ -133,6 +144,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 {
   const size_t __i = _Facet::id._M_id();
   const locale::facet** __facets = __loc._M_impl->_M_facets;
+
+  if _GLIBCXX17_CONSTEXPR (__is_same(_Facet, ctype)
+|| __is_same(_Facet, num_get)
+|| __is_same(_Facet, num_put))
+   return static_cast(*__facets[__i]);
+#ifdef _GLIBCXX_USE_WCHAR_T
+  else if _GLIBCXX17_CONSTEXPR (__is_same(_Facet, ctype)
+ || __is_same(_Facet, num_get)
+ || __is_same(_Facet, num_put))
+   return static_cast(*__facets[__i]);
+#endif
+
   if (__i >= __loc._M_impl->_M_facets_size || !__facets[__i])
 __throw_bad_cast();
 #if __cpp_rtti
diff --git a/libstdc++-v3/testsuite/22_locale/ctype/is/string/89728_neg.cc 
b/libstdc++-v3/testsuite/22_locale/ctype/is/string/89728_neg.cc
index 77bb1a64f45..baed67d64f4 100644
--- a/libstdc++-v3/testsuite/22_locale/ctype/is/string/89728_neg.cc
+++ b/libstdc++-v3/testsuite/22_locale/ctype/is/string/89728_neg.cc
@@ -18,6 +18,7 @@
 // .
 
 // { dg-error "complete" "" { target *-*-* } 0 }
+// { dg-error "invalid .static_cast." "" { target c++14_down } 0 }
 
 #include 
 
-- 
2.40.0



[PATCH] i386: Remove REG_OK_FOR_INDEX/REG_OK_FOR_BASE and their derivatives

2023-04-21 Thread Uros Bizjak via Gcc-patches
x86 was converted to TARGET_LEGITIMATE_ADDRESS_P long ago.  Remove
remnants of the conversion.  Also, cleanup the remaining macros a bit
by introducing INDEX_REGNO_P macro.

No functional change.

gcc/ChangeLog:

2023-04-21  Uroš Bizjak  

* config/i386/i386.h (REG_OK_FOR_INDEX_P, REG_OK_FOR_BASE_P): Remove.
(REG_OK_FOR_INDEX_NONSTRICT_P,  REG_OK_FOR_BASE_NONSTRICT_P): Ditto.
(REG_OK_FOR_INDEX_STRICT_P, REG_OK_FOR_BASE_STRICT_P): Ditto.

(FIRST_INDEX_REG, LAST_INDEX_REG): New defines.
(LEGACY_INDEX_REG_P, LEGACY_INDEX_REGNO_P): New macros.
(INDEX_REG_P, INDEX_REGNO_P): Ditto.

(REGNO_OK_FOR_INDEX_P): Use INDEX_REGNO_P predicates.

(REGNO_OK_FOR_INDEX_NONSTRICT_P): New macro.
(EG_OK_FOR_BASE_NONSTRICT_P): Ditto.

* config/i386/predicates.md (index_register_operand):
Use REGNO_OK_FOR_INDEX_P and REGNO_OK_FOR_INDEX_NONSTRICT_P macros.

* config/i386/i386.cc (ix86_legitimate_address_p): Use
REGNO_OK_FOR_BASE_P, REGNO_OK_FOR_BASE_NONSTRICT_P,
REGNO_OK_FOR_INDEX_P and REGNO_OK_FOR_INDEX_NONSTRICT_P macros.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Pushed to master.

Uros.
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index fbd33a6bfd1..a3db55642e3 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -11035,8 +11035,9 @@ ix86_legitimate_address_p (machine_mode, rtx addr, bool 
strict)
   if (reg == NULL_RTX)
return false;
 
-  if ((strict && ! REG_OK_FOR_BASE_STRICT_P (reg))
- || (! strict && ! REG_OK_FOR_BASE_NONSTRICT_P (reg)))
+  unsigned int regno = REGNO (reg);
+  if ((strict && !REGNO_OK_FOR_BASE_P (regno))
+ || (!strict && !REGNO_OK_FOR_BASE_NONSTRICT_P (regno)))
/* Base is not valid.  */
return false;
 }
@@ -11049,8 +11050,9 @@ ix86_legitimate_address_p (machine_mode, rtx addr, bool 
strict)
   if (reg == NULL_RTX)
return false;
 
-  if ((strict && ! REG_OK_FOR_INDEX_STRICT_P (reg))
- || (! strict && ! REG_OK_FOR_INDEX_NONSTRICT_P (reg)))
+  unsigned int regno = REGNO (reg);
+  if ((strict && !REGNO_OK_FOR_INDEX_P (regno))
+ || (!strict && !REGNO_OK_FOR_INDEX_NONSTRICT_P (regno)))
/* Index is not valid.  */
return false;
 }
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 1da6dce8e0b..c7439f89bdf 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1166,6 +1166,9 @@ extern const char *host_detect_local_cpu (int argc, const 
char **argv);
 #define FIRST_INT_REG AX_REG
 #define LAST_INT_REG  SP_REG
 
+#define FIRST_INDEX_REG AX_REG
+#define LAST_INDEX_REG  BP_REG
+
 #define FIRST_QI_REG AX_REG
 #define LAST_QI_REG  BX_REG
 
@@ -1404,7 +1407,11 @@ enum reg_class
 #define QI_REGNO_P(N) IN_RANGE ((N), FIRST_QI_REG, LAST_QI_REG)
 
 #define LEGACY_INT_REG_P(X) (REG_P (X) && LEGACY_INT_REGNO_P (REGNO (X)))
-#define LEGACY_INT_REGNO_P(N) (IN_RANGE ((N), FIRST_INT_REG, LAST_INT_REG))
+#define LEGACY_INT_REGNO_P(N) IN_RANGE ((N), FIRST_INT_REG, LAST_INT_REG)
+
+#define LEGACY_INDEX_REG_P(X) (REG_P (X) && LEGACY_INDEX_REGNO_P (REGNO (X)))
+#define LEGACY_INDEX_REGNO_P(N) \
+  IN_RANGE ((N), FIRST_INDEX_REG, LAST_INDEX_REG)
 
 #define REX_INT_REG_P(X) (REG_P (X) && REX_INT_REGNO_P (REGNO (X)))
 #define REX_INT_REGNO_P(N) \
@@ -1414,6 +1421,10 @@ enum reg_class
 #define GENERAL_REGNO_P(N) \
   (LEGACY_INT_REGNO_P (N) || REX_INT_REGNO_P (N))
 
+#define INDEX_REG_P(X) (REG_P (X) && INDEX_REGNO_P (REGNO (X)))
+#define INDEX_REGNO_P(N) \
+  (LEGACY_INDEX_REGNO_P (N) || REX_INT_REGNO_P (N))
+
 #define ANY_QI_REG_P(X) (REG_P (X) && ANY_QI_REGNO_P (REGNO (X)))
 #define ANY_QI_REGNO_P(N) \
   (TARGET_64BIT ? GENERAL_REGNO_P (N) : QI_REGNO_P (N))
@@ -1678,56 +1689,26 @@ typedef struct ix86_args {
has been allocated, which happens in reginfo.cc during register
allocation.  */
 
-#define REGNO_OK_FOR_INDEX_P(REGNO)\
-  ((REGNO) < STACK_POINTER_REGNUM  \
-   || REX_INT_REGNO_P (REGNO)  \
-   || (unsigned) reg_renumber[(REGNO)] < STACK_POINTER_REGNUM  \
-   || REX_INT_REGNO_P ((unsigned) reg_renumber[(REGNO)]))
+#define REGNO_OK_FOR_INDEX_P(REGNO)\
+  (INDEX_REGNO_P (REGNO)   \
+   || INDEX_REGNO_P (reg_renumber[(REGNO)]))
 
-#define REGNO_OK_FOR_BASE_P(REGNO) \
+#define REGNO_OK_FOR_BASE_P(REGNO) \
   (GENERAL_REGNO_P (REGNO) \
|| (REGNO) == ARG_POINTER_REGNUM\
|| (REGNO) == FRAME_POINTER_REGNUM  \
-   || GENERAL_REGNO_P ((unsigned) reg_renumber[(REGNO)]))
-
-/* The macros REG_OK_FOR..._P assume that the arg is a REG rtx
-   and check its validity for a certain class.
-   We 

[PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-04-21 Thread liuhongt via Gcc-patches
> > +  if (!TARGET_SSE2)
> > +{
> > +  if (c_dialect_cxx ()
> > +   && cxx_dialect > cxx20)
>
> Formatting, both conditions are short, so just put them on one line.
Changed.

> But for the C++23 macros, more importantly I think we really should
> also in ix86_target_macros_internal add
>   if (c_dialect_cxx ()
>   && cxx_dialect > cxx20
>   && (isa_flag & OPTION_MASK_ISA_SSE2))
> {
>   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
>   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> }
> plus associated libstdc++ changes.  It can be done incrementally though.
Added in PATCH 2/2

> > +      if (flag_building_libgcc)
> > +     {
> > +       /* libbid uses __LIBGCC_HAS_HF_MODE__ and __LIBGCC_HAS_BF_MODE__
> > +          to check backend support of _Float16 and __bf16 type.  */
>
> That is actually the case only for HFmode, but not for BFmode right now.
> So, we need further work.  One is to add the BFmode support in there,
> and another one is make sure the _Float16 <-> _Decimal* and __bf16 <->
> _Decimal* conversions are compiled in also if not -msse2 by default.
> One way to do that is wrap the HF and BF mode related functions on x86
> #ifndef __SSE2__ into the pragmas like intrin headers use (but then
> perhaps we don't need to undef this stuff here), another is not provide
> the hf/bf support in that case from the TUs where they are provided now,
> but from a different one which would be compiled with -msse2.
Add CFLAGS-_hf_to_sd.c += -msse2, similar for other files in libbid, just like
we did before for HFtype softfp. Then no need to undef libgcc macros.

> >/* We allowed the user to turn off SSE for kernel mode.  Don't crash if
> >   some less clueful developer tries to use floating-point anyway.  */
> > -  if (needed_sseregs && !TARGET_SSE)
> > +  if (needed_sseregs
> > +  && (!TARGET_SSE
> > +   || (VALID_SSE2_TYPE_MODE (mode)
> > +   && !TARGET_SSE2)))
>
> Formatting, no need to split this up that much.
>   if (needed_sseregs
>   && (!TARGET_SSE
>   || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> or even better
>   if (needed_sseregs
>   && (!TARGET_SSE || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> will do it.
Changed.

> Instead of this, just use
>   if (!float16_type_node)
> {
>   float16_type_node = ix86_float16_type_node;
>   callback (float16_type_node);
>   float16_type_node = NULL_TREE;
> }
>   if (!bfloat16_type_node)
> {
>   bfloat16_type_node = ix86_bf16_type_node;
>   callback (bfloat16_type_node);
>   bfloat16_type_node = NULL_TREE;
> }
Changed.


> > +static const char *
> > +ix86_invalid_conversion (const_tree fromtype, const_tree totype)
> > +{
> > +  if (element_mode (fromtype) != element_mode (totype))
> > +    {
> > +      /* Do no allow conversions to/from BFmode/HFmode scalar types
> > +      when TARGET_SSE2 is not available.  */
> > +      if ((TYPE_MODE (fromtype) == BFmode
> > +        || TYPE_MODE (fromtype) == HFmode)
> > +       && !TARGET_SSE2)
>
> First of all, not really sure if this should be purely about scalar
> modes, not also complex and vector modes involving those inner modes.
> Because complex or vector modes with BF/HF elements will be without
> TARGET_SSE2 for sure lowered into scalar code and that can't be handled
> either.
> So if (!TARGET_SSE2 && GET_MODE_INNER (TYPE_MODE (fromtype)) == BFmode)
> or even better
> if (!TARGET_SSE2 && element_mode (fromtype) == BFmode)
> ?
> Or even better remember the 2 modes above into machine_mode temporaries
> and just use those in the != comparison and for the checks?
>
> Also, I think it is weird to tell user %<__bf16%> or %<_Float16%> when
> we know which one it is.  Just return separate messages?
Changed.

> > +  /* Reject all single-operand operations on BFmode/HFmode except for &
> > +     when TARGET_SSE2 is not available.  */
> > +  if ((element_mode (type) == BFmode || element_mode (type) == HFmode)
> > +      && !TARGET_SSE2 && op != ADDR_EXPR)
> > +    return N_("operation not permitted on type %<__bf16%> "
> > +           "or %<_Float16%> without option %<-msse2%>");
>
> Similarly.  Also, check !TARGET_SSE2 first as inexpensive one.
Changed.


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Successfully cross-build i686-linux-gnu.
Ok for trunk?

Enable _Float16 and __bf16 all the time but issue errors when the
types are used in conversion, unary operation, binary operation,
parameter passing or value return when TARGET_SSE2 is not available.

Also undef macros which are used by libgcc/libstdc++ to check the
backend support of the _Float16/__bf16 types when TARGET_SSE2 is not
available.

gcc/ChangeLog:

PR target/109504
* config/i386/i386-builtins.cc
(ix86_register_float16_builtin_type): Remove TARGET_SSE2.
(ix86_register_bf16_builtin_type): Ditto.
* config/i386/i386-c.cc 

[PATCH 2/2] [i386] def_or_undef __STDCPP_FLOAT16_T__ and __STDCPP_BFLOAT16_T__ for target attribute/pragmas.

2023-04-21 Thread liuhongt via Gcc-patches
> But for the C++23 macros, more importantly I think we really should
> also in ix86_target_macros_internal add
>   if (c_dialect_cxx ()
>   && cxx_dialect > cxx20
>   && (isa_flag & OPTION_MASK_ISA_SSE2))
> {
>   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
>   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> }
> plus associated libstdc++ changes.  It can be done incrementally though.
Changed except for one place in libsupc++/compare, it's inside a function
where pragma can be added. Not sure if this inconsistency will cause any
issue.

#ifdef __STDCPP_BFLOAT16_T__
  if constexpr (__is_same(_Tp, decltype(0.0bf16)))
return _Bfloat16;
#endif

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Successfully cross-build i686-linux-gnu.
Ok for trunk?

def_or_undef  target macros based on currently active ISA in pragmas
to also do that for __STDCPP_FLOAT16_T__ and __STDCPP_BFLOAT16_T__ for
C++, and change libstdc++ such that for x86 it adds similarly to x86
intrin headers something like around std::float16_t/std::bfloat16_t stuff.

gcc/ChangeLog:

PR target/109504
* config/i386/i386-c.cc (ix86_target_macros_internal):
def_or_undef __STDCPP_FLOAT16_T__ and __STDCPP_BFLOAT16_T__.

libstdc++-v3/ChangeLog:

* include/bits/c++config: Add #pragma GCC target("sse2") for
_Float16 and bfloat16_t when __SSE2__ is not available.
* include/bits/cpp_type_traits.h: Ditto.
* include/bits/std_abs.h: Ditto.
* include/c_global/cmath: Ditto.
* include/ext/type_traits.h: Ditto.
* include/std/atomic: Ditto.
* include/std/charconv: Ditto.
* include/std/complex: Ditto.
* include/std/istream: Ditto.
* include/std/limits: Ditto.
* include/std/numbers: Ditto.
* include/std/ostream: Ditto.
* include/std/stdfloat: Ditto.
* include/std/type_traits: Ditto.
---
 gcc/config/i386/i386-c.cc   |   9 +-
 libstdc++-v3/include/bits/c++config |  11 +
 libstdc++-v3/include/bits/cpp_type_traits.h |  27 +-
 libstdc++-v3/include/bits/std_abs.h |  23 +-
 libstdc++-v3/include/c_global/cmath | 733 +++-
 libstdc++-v3/include/ext/type_traits.h  |  23 +-
 libstdc++-v3/include/std/atomic |  43 +-
 libstdc++-v3/include/std/charconv   |  90 ++-
 libstdc++-v3/include/std/complex| 227 +++---
 libstdc++-v3/include/std/istream|  61 +-
 libstdc++-v3/include/std/limits |  37 +-
 libstdc++-v3/include/std/numbers|  11 +
 libstdc++-v3/include/std/ostream|  29 +-
 libstdc++-v3/include/std/stdfloat   |  19 +-
 libstdc++-v3/include/std/type_traits|  23 +-
 15 files changed, 809 insertions(+), 557 deletions(-)

diff --git a/gcc/config/i386/i386-c.cc b/gcc/config/i386/i386-c.cc
index 2f83c9981e1..bcc17263e28 100644
--- a/gcc/config/i386/i386-c.cc
+++ b/gcc/config/i386/i386-c.cc
@@ -492,7 +492,14 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
   if (isa_flag & OPTION_MASK_ISA_SSE)
 def_or_undef (parse_in, "__SSE__");
   if (isa_flag & OPTION_MASK_ISA_SSE2)
-def_or_undef (parse_in, "__SSE2__");
+{
+  def_or_undef (parse_in, "__SSE2__");
+  if (c_dialect_cxx () && cxx_dialect > cxx20)
+   {
+ def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
+ def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
+   }
+}
   if (isa_flag & OPTION_MASK_ISA_SSE3)
 def_or_undef (parse_in, "__SSE3__");
   if (isa_flag & OPTION_MASK_ISA_SSSE3)
diff --git a/libstdc++-v3/include/bits/c++config 
b/libstdc++-v3/include/bits/c++config
index 13892787e09..c858497fc6e 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -820,6 +820,12 @@ namespace std
 # define _GLIBCXX_LDOUBLE_IS_IEEE_BINARY128 1
 #endif
 
+#ifndef __SSE2__
+#pragma GCC push_options
+#pragma GCC target("sse2")
+#define __DISABLE_STDCPP_SSE2__
+#endif
+
 #ifdef __STDCPP_BFLOAT16_T__
 namespace __gnu_cxx
 {
@@ -827,6 +833,11 @@ namespace __gnu_cxx
 }
 #endif
 
+#ifdef __DISABLE_STDCPP_SSE2__
+#undef __DISABLE_STDCPP_SSE2__
+#pragma GCC pop_options
+#endif
+
 #ifdef __has_builtin
 # ifdef __is_identifier
 // Intel and older Clang require !__is_identifier for some built-ins:
diff --git a/libstdc++-v3/include/bits/cpp_type_traits.h 
b/libstdc++-v3/include/bits/cpp_type_traits.h
index 4312f32a4e0..cadd5ca4fde 100644
--- a/libstdc++-v3/include/bits/cpp_type_traits.h
+++ b/libstdc++-v3/include/bits/cpp_type_traits.h
@@ -315,6 +315,12 @@ __INT_N(__GLIBCXX_TYPE_INT_N_3)
   typedef __true_type __type;
 };
 
+#ifndef __SSE2__
+#pragma GCC push_options
+#pragma GCC target("sse2")
+#define __DISABLE_STDCPP_SSE2__
+#endif
+
 #ifdef __STDCPP_FLOAT16_T__
   template<>
 struct __is_floating<_Float16>
@@ -324,36 +330,41 @@ __INT_N(__GLIBCXX_TYPE_INT_N_3)
 };
 #endif
 
-#ifdef __STDCPP_FLOAT32_T__

[PATCH] This replaces uses of last_stmt where we do not require debug skipping

2023-04-21 Thread Richard Biener via Gcc-patches
There are quite some cases which want to access the control stmt
ending a basic-block.  Since there cannot be debug stmts after
such stmt there's no point in using last_stmt which skips debug
stmts and can be a compile-time hog for larger testcases.

This is a first batch of changes, a second is to come next week.
It relies on the previously posted/pushed operator* support for
stmt iterators and safe_is_a for some of the replacements.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

I'll leave this for comments in case there are any and will push
on Monday.

Richard.

* gimple-ssa-split-paths.cc (is_feasible_trace): Avoid
last_stmt.
* graphite-scop-detection.cc (single_pred_cond_non_loop_exit):
Likewise.
* ipa-fnsummary.cc (set_cond_stmt_execution_predicate): Likewise.
(set_switch_stmt_execution_predicate): Likewise.
(phi_result_unknown_predicate): Likewise.
* ipa-prop.cc (compute_complex_ancestor_jump_func): Likewise.
(ipa_analyze_indirect_call_uses): Likewise.
* predict.cc (predict_iv_comparison): Likewise.
(predict_extra_loop_exits): Likewise.
(predict_loops): Likewise.
(tree_predict_by_opcode): Likewise.
* gimple-predicate-analysis.cc (predicate::init_from_control_deps):
Likewise.
* gimple-pretty-print.cc (dump_implicit_edges): Likewise.
* tree-ssa-phiopt.cc (tree_ssa_phiopt_worker): Likewise.
(replace_phi_edge_with_variable): Likewise.
(two_value_replacement): Likewise.
(value_replacement): Likewise.
(minmax_replacement): Likewise.
(spaceship_replacement): Likewise.
(cond_removal_in_builtin_zero_pattern): Likewise.
* tree-ssa-reassoc.cc (maybe_optimize_range_tests): Likewise.
* tree-ssa-sccvn.cc (vn_phi_eq): Likewise.
(vn_phi_lookup): Likewise.
(vn_phi_insert): Likewise.
* tree-ssa-structalias.cc (compute_points_to_sets): Likewise.
* tree-ssa-threadbackwards.cc (back_threader::maybe_thread_block):
Likewise.
(back_threader_profitability::possibly_profitable_path_p):
Likewise.
* tree-ssa-threadedge.cc (jump_threader::thread_outgoing_edges):
Likewise.
* tree-switch-conversion.cc (pass_convert_switch::execute):
Likewise.
(pass_lower_switch::execute): Likewise.
* tree-tailcall.cc (tree_optimize_tail_calls_1): Likewise.
* tree-vect-loop-manip.cc (vect_loop_versioning): Likewise.
* tree-vect-slp.cc (vect_slp_function): Likewise.
* tree-vect-stmts.cc (cfun_returns): Likewise.
* tree-vectorizer.cc (vect_loop_vectorized_call): Likewise.
(vect_loop_dist_alias_call): Likewise.
---
 gcc/gimple-predicate-analysis.cc |  2 +-
 gcc/gimple-pretty-print.cc   |  5 +
 gcc/gimple-ssa-split-paths.cc|  5 ++---
 gcc/graphite-scop-detection.cc   |  9 ++---
 gcc/ipa-fnsummary.cc | 15 +--
 gcc/ipa-prop.cc  |  9 -
 gcc/predict.cc   | 25 +
 gcc/tree-ssa-phiopt.cc   | 32 ++--
 gcc/tree-ssa-reassoc.cc  |  2 +-
 gcc/tree-ssa-sccvn.cc|  8 
 gcc/tree-ssa-structalias.cc  |  2 +-
 gcc/tree-ssa-threadbackward.cc   |  4 ++--
 gcc/tree-ssa-threadedge.cc   |  4 +---
 gcc/tree-switch-conversion.cc|  9 +++--
 gcc/tree-tailcall.cc | 11 ++-
 gcc/tree-vect-loop-manip.cc  |  2 +-
 gcc/tree-vect-slp.cc |  2 +-
 gcc/tree-vect-stmts.cc   |  2 +-
 gcc/tree-vectorizer.cc   | 11 ---
 19 files changed, 59 insertions(+), 100 deletions(-)

diff --git a/gcc/gimple-predicate-analysis.cc b/gcc/gimple-predicate-analysis.cc
index 094e8c7aff3..c89a5b1653a 100644
--- a/gcc/gimple-predicate-analysis.cc
+++ b/gcc/gimple-predicate-analysis.cc
@@ -1830,7 +1830,7 @@ predicate::init_from_control_deps (const vec 
*dep_chains,
}
}
  /* Get the conditional controlling the bb exit edge.  */
- gimple *cond_stmt = last_stmt (guard_bb);
+ gimple *cond_stmt = *gsi_last_bb (guard_bb);
  if (gimple_code (cond_stmt) == GIMPLE_COND)
{
  /* The true edge corresponds to the uninteresting condition.
diff --git a/gcc/gimple-pretty-print.cc b/gcc/gimple-pretty-print.cc
index 300e9d7ed1e..e46f7d5f55a 100644
--- a/gcc/gimple-pretty-print.cc
+++ b/gcc/gimple-pretty-print.cc
@@ -3004,11 +3004,8 @@ dump_implicit_edges (pretty_printer *buffer, basic_block 
bb, int indent,
 dump_flags_t flags)
 {
   edge e;
-  gimple *stmt;
-
-  stmt = last_stmt (bb);
 
-  if (stmt && gimple_code (stmt) == GIMPLE_COND)
+  if (safe_is_a  (*gsi_last_bb (bb)))
 {
   edge true_edge, false_edge;
 
diff --git a/gcc/gimple-ssa-split-paths.cc b/gcc/gimple-ssa-split-paths.cc
index 44d87c24f60..5896f4d9b64 100644
--- 

[PATCH] Add safe_is_a

2023-04-21 Thread Richard Biener via Gcc-patches
The following adds safe_is_a, an is_a check handling nullptr
gracefully.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

* is-a.h (safe_is_a): New.
---
 gcc/is-a.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/gcc/is-a.h b/gcc/is-a.h
index b5355242655..0a697cf002a 100644
--- a/gcc/is-a.h
+++ b/gcc/is-a.h
@@ -232,6 +232,19 @@ is_a (U *p)
   return is_a_helper::test (p);
 }
 
+/* Similar to is_a<>, but where the pointer can be NULL, even if
+   is_a_helper doesn't check for NULL.  */
+
+template 
+inline bool
+safe_is_a (U *p)
+{
+  if (p)
+return is_a_helper ::test (p);
+  else
+return false;
+}
+
 /* A generic conversion from a base type U to a derived type T.  See the
discussion above for when to use this function.  */
 
-- 
2.35.3


[PATCH] Add operator* to gimple_stmt_iterator and gphi_iterator

2023-04-21 Thread Richard Biener via Gcc-patches
This allows STL style iterator dereference.  It's the same
as gsi_stmt () or .phi ().

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

* gimple-iterator.h (gimple_stmt_iterator::operator*): Add.
(gphi_iterator::operator*): Likewise.
---
 gcc/gimple-iterator.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/gcc/gimple-iterator.h b/gcc/gimple-iterator.h
index 38352aa95af..b709923f00d 100644
--- a/gcc/gimple-iterator.h
+++ b/gcc/gimple-iterator.h
@@ -24,6 +24,8 @@ along with GCC; see the file COPYING3.  If not see
 
 struct gimple_stmt_iterator
 {
+  gimple *operator * () const { return ptr; }
+
   /* Sequence node holding the current statement.  */
   gimple_seq_node ptr;
 
@@ -38,6 +40,8 @@ struct gimple_stmt_iterator
 /* Iterator over GIMPLE_PHI statements.  */
 struct gphi_iterator : public gimple_stmt_iterator
 {
+  gphi *operator * () const { return as_a  (ptr); }
+
   gphi *phi () const
   {
 return as_a  (ptr);
-- 
2.35.3


Fix loop-ch

2023-04-21 Thread Jan Hubicka via Gcc-patches
Hi,
Ondrej Kubanek implemented profiling of loop histograms which sould be useful 
to improve
i.e. quality of loop peeling of verctorization.  However it turns out that most 
of histograms
are lost on the way from profiling to loop peeling pass (about 90%).  One 
common case is the
following latent bug in loop header copying which forgets to update the loop 
header pointer.

Curiously enough it does work to make single latch and preheader edge by 
splitting basic blocks
but it works with wrong edge.  As a consequence every loop where loop header 
was copied is
removed from loop tree and inserted again losing all metadata included.

Patch correctly updates the loop structure and also add verification
that the loop tree is OK after all transforms which is failing without
the patch.

Bootstrapped/regtested x86_64-linux, plan to insteall this as obvious.

diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 6b332719411..2c56b3b3c31 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -557,6 +557,17 @@ ch_base::copy_headers (function *fun)
}
}
 
+  /* Update header of the loop.  */
+  loop->header = header;
+  /* Find correct latch.  We only duplicate chain of conditionals so
+ there should be precisely two edges to the new header.  One entry
+ edge and one to latch.  */
+  FOR_EACH_EDGE (e, ei, loop->header->preds)
+   if (header != e->src)
+ {
+   loop->latch = e->src;
+   break;
+ }
   /* Ensure that the latch and the preheader is simple (we know that they
 are not now, since there was the loop exit condition.  */
   split_edge (loop_preheader_edge (loop));
@@ -583,6 +594,8 @@ ch_base::copy_headers (function *fun)
 
   if (changed)
 {
+  if (flag_checking)
+   verify_loop_structure ();
   update_ssa (TODO_update_ssa);
   /* After updating SSA form perform CSE on the loop header
 copies.  This is esp. required for the pass before


Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut optimization

2023-04-21 Thread Kito Cheng via Gcc-patches
Hi Pan:

One idea come to my mind, maybe we should add a new
define_insn_and_split pattern instead of change @pred_mov

On Fri, Apr 21, 2023 at 7:17 PM Li, Pan2 via Gcc-patches
 wrote:
>
> Thanks kito, will try to reproduce this issue and keep you posted.
>
> Pan
>
> -Original Message-
> From: Kito Cheng 
> Sent: Friday, April 21, 2023 6:17 PM
> To: Li, Pan2 
> Cc: juzhe.zh...@rivai.ai; gcc-patches ; Kito.cheng 
> ; Wang, Yanzhang 
> Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> I got a bunch of new fails including ICE for gcc testsuite, and some cases 
> are hanging there, could you take a look?
>
> $ riscv64-unknown-linux-gnu-gcc
> gcc.target/riscv/rvv/vsetvl/avl_single-92.c -O2 -march=rv32gcv
> -mabi=ilp32
> during RTL pass: expand
> /scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:
> In function 'f':
> /scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:8:13:
> internal compiler error: in maybe_gen_insn, at optabs.cc:8102
> 8 |   vbool64_t mask = *(vbool64_t*) (in + 100);
>   | ^~~~
> 0x130d278 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
> ../../../../riscv-gnu-toolchain-trunk/gcc/gcc/optabs.cc:8102
>
>
> On Fri, Apr 21, 2023 at 5:47 PM Li, Pan2 via Gcc-patches 
>  wrote:
> >
> > Kindly ping for the PATCH v2. Just FYI there will be some underlying 
> > investigation based on this PATCH like VMSEQ.
> >
> > Pan
> >
> > -Original Message-
> > From: Li, Pan2
> > Sent: Wednesday, April 19, 2023 7:27 PM
> > To: 'Kito Cheng' ; 'juzhe.zh...@rivai.ai'
> > 
> > Cc: 'gcc-patches' ; 'Kito.cheng'
> > ; Wang, Yanzhang 
> > Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut
> > optimization
> >
> > Update the Patch v2 for more detail information for clarification. Please 
> > help to review continuously.
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616175.html
> >
> > Pan
> >
> > -Original Message-
> > From: Li, Pan2
> > Sent: Wednesday, April 19, 2023 6:33 PM
> > To: Kito Cheng ; juzhe.zh...@rivai.ai
> > Cc: gcc-patches ; Kito.cheng
> > ; Wang, Yanzhang 
> > Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut
> > optimization
> >
> > Sure thing.
> >
> > For Changlog, I consider it was generated automatically in previous. LOL.
> >
> > Pan
> >
> > -Original Message-
> > From: Kito Cheng 
> > Sent: Wednesday, April 19, 2023 5:46 PM
> > To: juzhe.zh...@rivai.ai
> > Cc: Li, Pan2 ; gcc-patches
> > ; Kito.cheng ; Wang,
> > Yanzhang 
> > Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut
> > optimization
> >
> > HI JuZhe:
> >
> > Thanks for explaining!
> >
> >
> > Hi Pan:
> >
> > I think that would be helpful if JuZhe's explaining that could be written 
> > into the commit log.
> >
> >
> > > gcc/ChangeLog:
> > >
> > >* config/riscv/riscv-v.cc (emit_pred_op):
> > >* config/riscv/riscv-vector-builtins-bases.cc:
> > >* config/riscv/vector.md:
> >
> > And don't forgot write some thing in ChangeLog...:P


[PATCH v4 1/4] rs6000: Enable REE pass by default

2023-04-21 Thread Ajit Agarwal via Gcc-patches
Hello All:

This patch enable REE pass by default at O2 and above.
Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

rs6000: Enable REE pass by default

Add ree pass as a default pass for rs6000 target for
O2 and above.

2023-04-21  Ajit Kumar Agarwal  

gcc/ChangeLog:

* common/config/rs6000/rs6000-common.cc: Add REE pass as a
default rs6000 target pass for O2 and above.
---
 gcc/common/config/rs6000/rs6000-common.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/common/config/rs6000/rs6000-common.cc 
b/gcc/common/config/rs6000/rs6000-common.cc
index 2140c442ba9..968db215028 100644
--- a/gcc/common/config/rs6000/rs6000-common.cc
+++ b/gcc/common/config/rs6000/rs6000-common.cc
@@ -34,6 +34,8 @@ static const struct default_options 
rs6000_option_optimization_table[] =
 { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 },
 /* Enable -fsched-pressure for first pass instruction scheduling.  */
 { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
+/* Enable -free for zero extension and sign extension elimination.*/
+{ OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
 /* Enable -munroll-only-small-loops with -funroll-loops to unroll small
loops at -O2 and above by default.  */
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_funroll_loops, NULL, 1 },
-- 
2.31.1



Re: [PATCH] match.pd: Fix fneg/fadd optimization [PR109583]

2023-04-21 Thread Richard Biener via Gcc-patches
On Fri, 21 Apr 2023, Jakub Jelinek wrote:

> Hi!
> 
> The following testcase ICEs on x86, foo function since my r14-22
> improvement, but bar already since r13-4122.  The problem is the same,
> in the if expression related_vector_mode is called and that starts with
>   gcc_assert (VECTOR_MODE_P (vector_mode));
> but nothing in the fneg/fadd match.pd pattern actually checks if the
> VEC_PERM type has VECTOR_MODE_P (vec_mode).  In this case it has BLKmode
> and so it ICEs.
> 
> The following patch makes sure we don't ICE on it.
> Ok for trunk and 13.1 (it is a 13/14 Regression and I think the fix
> is quite obvious and safe) if it passes bootstrap/regtest?

OK for both.

Richard.

> 2023-04-21  Jakub Jelinek  
> 
>   PR tree-optimization/109583
>   * match.pd (fneg/fadd simplify): Don't call related_vector_mode
>   if vec_mode is not VECTOR_MODE_P.
> 
>   * gcc.dg/pr109583.c: New test.
> 
> --- gcc/match.pd.jj   2023-04-18 11:01:38.867871375 +0200
> +++ gcc/match.pd  2023-04-21 13:26:01.250166206 +0200
> @@ -8103,7 +8103,8 @@ and,
>   poly_uint64 wide_nunits;
>   scalar_mode inner_mode = GET_MODE_INNER (vec_mode);
>}
> -  (if (sel.series_p (0, 2, 0, 2)
> +  (if (VECTOR_MODE_P (vec_mode)
> +&& sel.series_p (0, 2, 0, 2)
>  && sel.series_p (1, 2, nelts + 1, 2)
>  && GET_MODE_2XWIDER_MODE (inner_mode).exists (_elt_mode)
>  && multiple_p (GET_MODE_NUNITS (vec_mode), 2, _nunits)
> --- gcc/testsuite/gcc.dg/pr109583.c.jj2023-04-21 13:28:36.462911138 
> +0200
> +++ gcc/testsuite/gcc.dg/pr109583.c   2023-04-21 13:28:06.746342736 +0200
> @@ -0,0 +1,25 @@
> +/* PR tree-optimization/109583 */
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -Wno-psabi" } */
> +/* { dg-additional-options "-mno-avx" { target i?86-*-* x86_64-*-* } } */
> +
> +typedef float v8sf __attribute__((vector_size (8 * sizeof (float;
> +typedef int v8si __attribute__((vector_size (8 * sizeof (int;
> +
> +#if __SIZEOF_INT__ == __SIZEOF_FLOAT__
> +v8sf
> +foo (v8sf x, v8sf y)
> +{
> +  v8sf a = x - y;
> +  v8sf b = x + y;
> +  return __builtin_shuffle (a, b, (v8si) { 0, 9, 2, 11, 4, 13, 6, 15 });
> +}
> +
> +v8sf
> +bar (v8sf x, v8sf y)
> +{
> +  v8sf a = x + y;
> +  v8sf b = x - y;
> +  return __builtin_shuffle (a, b, (v8si) { 0, 9, 2, 11, 4, 13, 6, 15 });
> +}
> +#endif
> 
>   Jakub
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Stabilize inliner Fibonacci heap

2023-04-21 Thread Jan Hubicka via Gcc-patches
Hi,
This fixes another problem Michal noticed while working on incrmeental
WHOPR.  The Fibonacci heap can change its behaviour quite significantly
for no good reasons when multiple edges with same key occurs.  This is
quite common for small functions.

This patch stabilizes the order by adding edge uids into the info.
Again I think this is good idea regardless of the incremental WPA project
since we do not want random changes in inline decisions.

Bootstrapped/regtested x86_64-linux, plan to commit it shortly.
gcc/ChangeLog:

* ipa-inline.cc (class inline_badness): New class.
(edge_heap_t, edge_heap_node_t): Use inline_badness for badness instead
of sreal.
(update_edge_key): Update.
(lookup_recursive_calls): Likewise.
(recursive_inlining): Likewise.
(add_new_edges_to_heap): Likewise.
(inline_small_functions): Likewise.

diff --git a/gcc/ipa-inline.cc b/gcc/ipa-inline.cc
index 474fbff2057..efc8df7d4e0 100644
--- a/gcc/ipa-inline.cc
+++ b/gcc/ipa-inline.cc
@@ -120,8 +120,54 @@ along with GCC; see the file COPYING3.  If not see
 #include "attribs.h"
 #include "asan.h"
 
-typedef fibonacci_heap  edge_heap_t;
-typedef fibonacci_node  edge_heap_node_t;
+/* Inliner uses greedy algorithm to inline calls in a priority order.
+   Badness is used as the key in a Fibonacci heap which roughly corresponds
+   to negation of benefit to cost ratios.
+   In case multiple calls has same priority we want to stabilize the outcomes
+   for which we use ids.  */
+class inline_badness
+{
+public:
+  sreal badness;
+  int uid;
+  inline_badness ()
+  : badness (sreal::min ()), uid (0)
+  {
+  }
+  inline_badness (cgraph_edge *e, sreal b)
+  : badness (b), uid (e->get_uid ())
+  {
+  }
+  bool operator<= (const inline_badness )
+  {
+if (badness != other.badness)
+  return badness <= other.badness;
+return uid <= other.uid;
+  }
+  bool operator== (const inline_badness )
+  {
+return badness == other.badness && uid == other.uid;
+  }
+  bool operator!= (const inline_badness )
+  {
+return badness != other.badness || uid != other.uid;
+  }
+  bool operator< (const inline_badness )
+  {
+if (badness != other.badness)
+  return badness < other.badness;
+return uid < other.uid;
+  }
+  bool operator> (const inline_badness )
+  {
+if (badness != other.badness)
+  return badness > other.badness;
+return uid > other.uid;
+  }
+};
+
+typedef fibonacci_heap  edge_heap_t;
+typedef fibonacci_node  edge_heap_node_t;
 
 /* Statistics we collect about inlining algorithm.  */
 static int overall_size;
@@ -1399,7 +1445,7 @@ update_edge_key (edge_heap_t *heap, struct cgraph_edge 
*edge)
 We do lazy increases: after extracting minimum if the key
 turns out to be out of date, it is re-inserted into heap
 with correct value.  */
-  if (badness < n->get_key ())
+  if (badness < n->get_key ().badness)
{
  if (dump_file && (dump_flags & TDF_DETAILS))
{
@@ -1407,10 +1453,11 @@ update_edge_key (edge_heap_t *heap, struct cgraph_edge 
*edge)
   "  decreasing badness %s -> %s, %f to %f\n",
   edge->caller->dump_name (),
   edge->callee->dump_name (),
-  n->get_key ().to_double (),
+  n->get_key ().badness.to_double (),
   badness.to_double ());
}
- heap->decrease_key (n, badness);
+ inline_badness b (edge, badness);
+ heap->decrease_key (n, b);
}
 }
   else
@@ -1423,7 +1470,8 @@ update_edge_key (edge_heap_t *heap, struct cgraph_edge 
*edge)
edge->callee->dump_name (),
badness.to_double ());
 }
-  edge->aux = heap->insert (badness, edge);
+  inline_badness b (edge, badness);
+  edge->aux = heap->insert (b, edge);
 }
 }
 
@@ -1630,7 +1678,10 @@ lookup_recursive_calls (struct cgraph_node *node, struct 
cgraph_node *where,
 if (e->callee == node
|| (e->callee->ultimate_alias_target (, e->caller) == node
&& avail > AVAIL_INTERPOSABLE))
-  heap->insert (-e->sreal_frequency (), e);
+{
+  inline_badness b (e, -e->sreal_frequency ());
+  heap->insert (b, e);
+}
   for (e = where->callees; e; e = e->next_callee)
 if (!e->inline_failed)
   lookup_recursive_calls (node, e->callee, heap);
@@ -1649,7 +1700,8 @@ recursive_inlining (struct cgraph_edge *edge,
  ? edge->caller->inlined_to : edge->caller);
   int limit = opt_for_fn (to->decl,
  param_max_inline_insns_recursive_auto);
-  edge_heap_t heap (sreal::min ());
+  inline_badness b (edge, sreal::min ());
+  edge_heap_t heap (b);
   struct cgraph_node *node;
   struct cgraph_edge *e;
   struct cgraph_node *master_clone = NULL, *next;
@@ -1809,7 +1861,10 @@ add_new_edges_to_heap (edge_heap_t *heap, 
vec _edges)
  && 

[PATCH] match.pd: Fix fneg/fadd optimization [PR109583]

2023-04-21 Thread Jakub Jelinek via Gcc-patches
Hi!

The following testcase ICEs on x86, foo function since my r14-22
improvement, but bar already since r13-4122.  The problem is the same,
in the if expression related_vector_mode is called and that starts with
  gcc_assert (VECTOR_MODE_P (vector_mode));
but nothing in the fneg/fadd match.pd pattern actually checks if the
VEC_PERM type has VECTOR_MODE_P (vec_mode).  In this case it has BLKmode
and so it ICEs.

The following patch makes sure we don't ICE on it.
Ok for trunk and 13.1 (it is a 13/14 Regression and I think the fix
is quite obvious and safe) if it passes bootstrap/regtest?

2023-04-21  Jakub Jelinek  

PR tree-optimization/109583
* match.pd (fneg/fadd simplify): Don't call related_vector_mode
if vec_mode is not VECTOR_MODE_P.

* gcc.dg/pr109583.c: New test.

--- gcc/match.pd.jj 2023-04-18 11:01:38.867871375 +0200
+++ gcc/match.pd2023-04-21 13:26:01.250166206 +0200
@@ -8103,7 +8103,8 @@ and,
poly_uint64 wide_nunits;
scalar_mode inner_mode = GET_MODE_INNER (vec_mode);
   }
-  (if (sel.series_p (0, 2, 0, 2)
+  (if (VECTOR_MODE_P (vec_mode)
+  && sel.series_p (0, 2, 0, 2)
   && sel.series_p (1, 2, nelts + 1, 2)
   && GET_MODE_2XWIDER_MODE (inner_mode).exists (_elt_mode)
   && multiple_p (GET_MODE_NUNITS (vec_mode), 2, _nunits)
--- gcc/testsuite/gcc.dg/pr109583.c.jj  2023-04-21 13:28:36.462911138 +0200
+++ gcc/testsuite/gcc.dg/pr109583.c 2023-04-21 13:28:06.746342736 +0200
@@ -0,0 +1,25 @@
+/* PR tree-optimization/109583 */
+/* { dg-do compile } */
+/* { dg-options "-O1 -Wno-psabi" } */
+/* { dg-additional-options "-mno-avx" { target i?86-*-* x86_64-*-* } } */
+
+typedef float v8sf __attribute__((vector_size (8 * sizeof (float;
+typedef int v8si __attribute__((vector_size (8 * sizeof (int;
+
+#if __SIZEOF_INT__ == __SIZEOF_FLOAT__
+v8sf
+foo (v8sf x, v8sf y)
+{
+  v8sf a = x - y;
+  v8sf b = x + y;
+  return __builtin_shuffle (a, b, (v8si) { 0, 9, 2, 11, 4, 13, 6, 15 });
+}
+
+v8sf
+bar (v8sf x, v8sf y)
+{
+  v8sf a = x + y;
+  v8sf b = x - y;
+  return __builtin_shuffle (a, b, (v8si) { 0, 9, 2, 11, 4, 13, 6, 15 });
+}
+#endif

Jakub



Stabilize temporary variable names

2023-04-21 Thread Jan Hubicka via Gcc-patches
Hi,
Michal Jires implemented quite well working prototype of cache for WPA which 
makes
it to re-use partitions from from earlier build when package is rebulit with 
smaller
changes.  It should be useful to improve edit/compile/debug cycles when one is
forced to debug with LTO enabled but hopefully also avoid duplicated work i.e.
during bootstrap where libbackend is linked into multiple frontends.

To make this work well, it is necessary to avoid local decisions from
one function to leak into others.  This patch fixes a problem he noticed
in create_tmp_var_name.  It has one global id appended after each name.
So calling it with create_tmp_var_name ("pretmp") yields to somehting
like "pretmp.1".  This global counters makes one function body to depend
on number of temporaries produced by another function body. In his
testcase a local change to large function in switch conversion pass
resulted in recompilation of most of partitions since an important
inline (unrelated to the patched function) had ID in it that has changed.

I think that independently on the incremental WPA project it is a good
idea to stabilize temporary names, in similar manner as we stabilized
symbol names of clones couple years back.

I think we want
 1) for local variables use local IDs
(this is used by gimplifier to produce temporaries, openmp lowering, thunk
 generation and by some passes to make names instead of unnamed SSA names)
 2) for global variables either appends function symbol name when the variable
exists to lower some construct in its body.
For example swtich conversion can use CSWITCH.foo.
instead of CSWITCH.
 3) for truly global variables have per-name IDs.

This patch implement only 1 but adds extra parameter separating locals from
globals.

Bootstrapped/regtested x86_64-linux, OK?

gcc/fortran/ChangeLog:

* trans-array.cc (gfc_build_constant_array_constructor): Update call of
create_tmp_var_name.
(gfc_trans_auto_array_allocation): Likewise.
* trans-decl.cc (gfc_add_assign_aux_vars): Likewise.
(create_function_arglist): Likewise.
(generate_coarray_init): Likewise.
(create_main_function): Likewise.

gcc/ChangeLog:

* function.h (struct function): Add tmp_var_id_num.
* gimple-expr.cc (create_tmp_var_name): Add parameter whether name is
local or global; for local use function local ids.
(create_tmp_var_raw): Udate.
* gimple-expr.h (create_tmp_var_name): Udate.
* gimplify.cc (internal_get_tmp_var): Udate.
(gimplify_init_constructor): Udate.
(gimplify_modify_expr): Udate.
(gimplify_function_tree): Udate.
* ipa-param-manipulation.cc 
(ipa_param_body_adjustments::common_initialization): Udate.
* lto-streamer-in.cc (input_struct_function_base): Udate.
* lto-streamer-out.cc (hash_tree): Udate.
* omp-low.cc (scan_omp_parallel): Udate.
(scan_omp_task): Udate.
(scan_omp_single): Udate.
(scan_omp_target): Udate.
(scan_omp_teams): Udate.
* omp-oacc-neuter-broadcast.cc (oacc_do_neutering): Udate.
* symtab-thunks.cc (expand_thunk): Udate.
* tree-nested.cc (get_chain_decl): Udate.
* tree-parloops.cc (separate_decls_in_region): Udate.
* tree-switch-conversion.cc (switch_conversion::build_one_array): Udate.


gcc/testsuite/ChangeLog:

* gfortran.dg/char_cast_1.f90: Update template.
* gfortran.dg/vector_subscript_4.f90: Update template.

diff --git a/gcc/fortran/trans-array.cc b/gcc/fortran/trans-array.cc
index e1725808033..f3de515f6c9 100644
--- a/gcc/fortran/trans-array.cc
+++ b/gcc/fortran/trans-array.cc
@@ -2623,7 +2623,7 @@ gfc_build_constant_array_constructor (gfc_expr * expr, 
tree type)
   TREE_CONSTANT (init) = 1;
   TREE_STATIC (init) = 1;
 
-  tmp = build_decl (input_location, VAR_DECL, create_tmp_var_name ("A"),
+  tmp = build_decl (input_location, VAR_DECL, create_tmp_var_name ("A", true),
tmptype);
   DECL_ARTIFICIAL (tmp) = 1;
   DECL_IGNORED_P (tmp) = 1;
@@ -6699,7 +6699,7 @@ gfc_trans_auto_array_allocation (tree decl, gfc_symbol * 
sym,
 {
   gcc_assert (TREE_CODE (TREE_TYPE (decl)) == POINTER_TYPE);
   space = build_decl (gfc_get_location (>declared_at),
- VAR_DECL, create_tmp_var_name ("A"),
+ VAR_DECL, create_tmp_var_name ("A", false),
  TREE_TYPE (TREE_TYPE (decl)));
   gfc_trans_vla_type_sizes (sym, );
 }
diff --git a/gcc/fortran/trans-decl.cc b/gcc/fortran/trans-decl.cc
index 299764b08b2..4c59e4dc0f4 100644
--- a/gcc/fortran/trans-decl.cc
+++ b/gcc/fortran/trans-decl.cc
@@ -1417,10 +1417,10 @@ gfc_add_assign_aux_vars (gfc_symbol * sym)
   gfc_allocate_lang_decl (decl);
   GFC_DECL_ASSIGN (decl) = 1;
   length = build_decl (input_location,
-  VAR_DECL, create_tmp_var_name (sym->name),
+  VAR_DECL, 

[PATCH] tree-optimization/109573 - avoid ICEing on unexpected live def

2023-04-21 Thread Richard Biener via Gcc-patches
The following relaxes the assert in vectorizable_live_operation
where we catch currently unhandled cases to also allow an
intermediate copy as it happens here but also relax the assert
to checking only.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/109573
* tree-vect-loop.cc (vectorizable_live_operation): Allow
unhandled SSA copy as well.  Demote assert to checking only.

* g++.dg/vect/pr109573.cc: New testcase.
---
 gcc/testsuite/g++.dg/vect/pr109573.cc | 91 +++
 gcc/tree-vect-loop.cc |  7 ++-
 2 files changed, 95 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/vect/pr109573.cc

diff --git a/gcc/testsuite/g++.dg/vect/pr109573.cc 
b/gcc/testsuite/g++.dg/vect/pr109573.cc
new file mode 100644
index 000..d96f86f9579
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/pr109573.cc
@@ -0,0 +1,91 @@
+// { dg-do compile }
+// { dg-require-effective-target c++20 }
+
+void *operator new(__SIZE_TYPE__, void *__p) { return __p; }
+template  struct _Head_base {
+  _Head _M_head_impl;
+};
+template  struct _Tuple_impl;
+template 
+struct _Tuple_impl<_Idx, _Head, _Tail...> : _Tuple_impl<_Idx + 1, _Tail...>,
+_Head_base<_Head> {
+  template 
+  _Tuple_impl(_UHead __head, _UTail... __tail)
+  : _Tuple_impl<_Idx + 1, _Tail...>(__tail...), _Head_base<_Head>(__head) 
{}
+};
+template  struct _Tuple_impl<_Idx, _Head> {
+  template  _Tuple_impl(_UHead);
+};
+template  struct tuple : _Tuple_impl<0, _Elements...> {
+  template 
+  tuple(_UElements... __elements)
+  : _Tuple_impl<0, _Elements...>(__elements...) {}
+};
+unsigned long position_;
+struct Zone {
+  template  T *New(Args... args) {
+return new (reinterpret_cast(position_)) T(args...);
+  }
+};
+struct Label {
+  int pos_;
+  int near_link_pos_;
+};
+enum Condition { below_equal };
+void bind(Label *);
+Zone *zone();
+unsigned long deopt_info_address();
+int MakeDeferredCode___trans_tmp_2, MakeDeferredCode___trans_tmp_3,
+Prologue___trans_tmp_6, MakeDeferredCode___trans_tmp_1;
+struct MaglevAssembler {
+  template 
+  void MakeDeferredCode(Function &&, Args &&...);
+  template 
+  void JumpToDeferredIf(Condition, Function, Args... args) {
+MakeDeferredCode(Function(), args...);
+  }
+  void Prologue();
+};
+struct ZoneLabelRef {
+  ZoneLabelRef(Zone *zone) : label_(zone->New()) {}
+  ZoneLabelRef(MaglevAssembler *) : ZoneLabelRef(zone()) {}
+  Label *operator*() { return label_; }
+  Label *label_;
+};
+template 
+struct FunctionArgumentsTupleHelper
+: FunctionArgumentsTupleHelper {};
+template 
+struct FunctionArgumentsTupleHelper {
+  using Tuple = tuple;
+};
+template  struct StripFirstTupleArg;
+template 
+struct StripFirstTupleArg> {
+  using Stripped = tuple;
+};
+template  struct DeferredCodeInfoImpl {
+  template 
+  DeferredCodeInfoImpl(int *, int, int, Function, InArgs... args)
+  : args(args...) {}
+  StripFirstTupleArg<
+  typename FunctionArgumentsTupleHelper::Tuple>::Stripped args;
+};
+template 
+void MaglevAssembler::MakeDeferredCode(Function &_code_gen,
+   Args &&...args) {
+  zone()->New>(
+  ___trans_tmp_1, MakeDeferredCode___trans_tmp_2,
+  MakeDeferredCode___trans_tmp_3, deferred_code_gen, args...);
+}
+void MaglevAssembler::Prologue() {
+  int *__trans_tmp_9;
+  ZoneLabelRef deferred_call_stack_guard_return(this);
+  __trans_tmp_9 = reinterpret_cast(deopt_info_address());
+  JumpToDeferredIf(
+  below_equal, [](MaglevAssembler, int *, ZoneLabelRef, int, int) {},
+  __trans_tmp_9, deferred_call_stack_guard_return, Prologue___trans_tmp_6,
+  0);
+  Label __trans_tmp_7 = **deferred_call_stack_guard_return;
+  bind(&__trans_tmp_7);
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index ba28214f09a..6ea0f21fd13 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10114,9 +10114,10 @@ vectorizable_live_operation (vec_info *vinfo,
use_stmt))
  {
enum tree_code code = gimple_assign_rhs_code (use_stmt);
-   gcc_assert (code == CONSTRUCTOR
-   || code == VIEW_CONVERT_EXPR
-   || CONVERT_EXPR_CODE_P (code));
+   gcc_checking_assert (code == SSA_NAME
+|| code == CONSTRUCTOR
+|| code == VIEW_CONVERT_EXPR
+|| CONVERT_EXPR_CODE_P (code));
if (dump_enabled_p ())
  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
   "Using original scalar computation for "
-- 
2.35.3


Remove dead handling of label_decl in tree merging

2023-04-21 Thread Jan Hubicka via Gcc-patches
Hi,
while working on incremental WHOPR with Michal Jires, we noticed that
there is code hashing LABEL_DECL_UID in lto-streamer-out which would
break the hash table, since label decls are not streamed and gets
re-initialized later.

The whole conditional is dead since LABEL_DECLs are not merged across
TUs.

ltobootstrapped/regtested x86_64-linux, plan to commit it shortly.

gcc/ChangeLog:

* lto-streamer-out.cc (hash_tree): Remove dead handling of LABEL_DECL.

gcc/lto/ChangeLog:

* lto-common.cc (compare_tree_sccs_1): Remove dead handling of 
LABEL_DECL.

diff --git a/gcc/lto-streamer-out.cc b/gcc/lto-streamer-out.cc
index 0bca530313c..27f9edd3fcd 100644
--- a/gcc/lto-streamer-out.cc
+++ b/gcc/lto-streamer-out.cc
@@ -1268,12 +1268,8 @@ hash_tree (struct streamer_tree_cache_d *cache, 
hash_map *map,
   hstate.add_flag (DECL_NOT_GIMPLE_REG_P (t));
   hstate.commit_flag ();
   hstate.add_int (DECL_ALIGN (t));
-  if (code == LABEL_DECL)
-   {
-  hstate.add_int (EH_LANDING_PAD_NR (t));
- hstate.add_int (LABEL_DECL_UID (t));
-   }
-  else if (code == FIELD_DECL)
+  gcc_checking_assert (code != LABEL_DECL);
+  if (code == FIELD_DECL)
{
  hstate.add_flag (DECL_PACKED (t));
  hstate.add_flag (DECL_NONADDRESSABLE_P (t));
diff --git a/gcc/lto/lto-common.cc b/gcc/lto/lto-common.cc
index 882dd8971a4..597fc5dbabf 100644
--- a/gcc/lto/lto-common.cc
+++ b/gcc/lto/lto-common.cc
@@ -1180,12 +1180,8 @@ compare_tree_sccs_1 (tree t1, tree t2, tree **map)
   compare_values (DECL_EXTERNAL);
   compare_values (DECL_NOT_GIMPLE_REG_P);
   compare_values (DECL_ALIGN);
-  if (code == LABEL_DECL)
-   {
- compare_values (EH_LANDING_PAD_NR);
- compare_values (LABEL_DECL_UID);
-   }
-  else if (code == FIELD_DECL)
+  gcc_checking_assert (code != LABEL_DECL);
+  if (code == FIELD_DECL)
{
  compare_values (DECL_PACKED);
  compare_values (DECL_NONADDRESSABLE_P);


[PATCH 3/3] Use correct CFG orders for DF worklist processing

2023-04-21 Thread Richard Biener via Gcc-patches
This adjusts the remaining three RPO computes in DF.  The DF_FORWARD
problems should use a RPO on the forward graph, the DF_BACKWARD
problems should use a RPO on the inverted graph.

Conveniently now inverted_rev_post_order_compute computes a RPO.
We still use post_order_compute and reverse its order for its
side-effect of deleting unreachable blocks.

This resuls in an overall reduction on visited blocks on cc1files by 5.2%.

Because on the reverse CFG most regions are irreducible, there's
few cases the number of visited blocks increases.  For the set
of cc1files I have this is for et-forest.i, graph.i, hwint.i,
tree-ssa-dom.i, tree-ssa-loop-ch.i and tree-ssa-threadedge.i.  For
tree-ssa-dse.i it's off-noise and I've more closely investigated
and figured it is really bad luck due to the irreducibility.

Bootstrapped and tested on x86_64-unknown-linux-gnu and the series pushed.

* df-core.cc (df_analyze): Compute RPO on the reverse graph
for DF_BACKWARD problems.
(loop_post_order_compute): Rename to ...
(loop_rev_post_order_compute): ... this, compute a RPO.
(loop_inverted_post_order_compute): Rename to ...
(loop_inverted_rev_post_order_compute): ... this, compute a RPO.
(df_analyze_loop): Use RPO on the forward graph for DF_FORWARD
problems, RPO on the inverted graph for DF_BACKWARD.
---
 gcc/df-core.cc | 36 
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index 27123645da5..d4812b04a7c 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1259,14 +1259,18 @@ df_analyze (void)
 
   free (df->postorder);
   free (df->postorder_inverted);
-  df->postorder = XNEWVEC (int, last_basic_block_for_fn (cfun));
-  df->n_blocks = post_order_compute (df->postorder, true, true);
   /* For DF_FORWARD use a RPO on the forward graph.  Since we want to
  have unreachable blocks deleted use post_order_compute and reverse
  the order.  */
   df->postorder_inverted = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
-  for (int i = 0; i < df->n_blocks; ++i)
-df->postorder_inverted[i] = df->postorder[df->n_blocks - 1 - i];
+  df->n_blocks = post_order_compute (df->postorder_inverted, true, true);
+  for (int i = 0; i < df->n_blocks / 2; ++i)
+std::swap (df->postorder_inverted[i],
+  df->postorder_inverted[df->n_blocks - 1 - i]);
+  /* For DF_BACKWARD use a RPO on the reverse graph.  */
+  df->postorder = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
+  int n = inverted_rev_post_order_compute (cfun, df->postorder);
+  gcc_assert (n == df->n_blocks);
 
   for (int i = 0; i < df->n_blocks; i++)
 bitmap_set_bit (current_all_blocks, df->postorder[i]);
@@ -1305,11 +1309,11 @@ df_analyze (void)
Returns the number of blocks which is always loop->num_nodes.  */
 
 static int
-loop_post_order_compute (int *post_order, class loop *loop)
+loop_rev_post_order_compute (int *post_order, class loop *loop)
 {
   edge_iterator *stack;
   int sp;
-  int post_order_num = 0;
+  int post_order_num = loop->num_nodes - 1;
 
   /* Allocate stack for back-tracking up CFG.  */
   stack = XNEWVEC (edge_iterator, loop->num_nodes + 1);
@@ -1342,13 +1346,13 @@ loop_post_order_compute (int *post_order, class loop 
*loop)
   time, check its successors.  */
stack[sp++] = ei_start (dest->succs);
  else
-   post_order[post_order_num++] = dest->index;
+   post_order[post_order_num--] = dest->index;
}
   else
{
  if (ei_one_before_end_p (ei)
  && src != loop_preheader_edge (loop)->src)
-   post_order[post_order_num++] = src->index;
+   post_order[post_order_num--] = src->index;
 
  if (!ei_one_before_end_p (ei))
ei_next ([sp - 1]);
@@ -1359,19 +1363,19 @@ loop_post_order_compute (int *post_order, class loop 
*loop)
 
   free (stack);
 
-  return post_order_num;
+  return loop->num_nodes;
 }
 
 /* Compute the reverse top sort order of the inverted sub-CFG specified
by LOOP.  Returns the number of blocks which is always loop->num_nodes.  */
 
 static int
-loop_inverted_post_order_compute (int *post_order, class loop *loop)
+loop_inverted_rev_post_order_compute (int *post_order, class loop *loop)
 {
   basic_block bb;
   edge_iterator *stack;
   int sp;
-  int post_order_num = 0;
+  int post_order_num = loop->num_nodes - 1;
 
   /* Allocate stack for back-tracking up CFG.  */
   stack = XNEWVEC (edge_iterator, loop->num_nodes + 1);
@@ -1408,13 +1412,13 @@ loop_inverted_post_order_compute (int *post_order, 
class loop *loop)
   time, check its predecessors.  */
stack[sp++] = ei_start (pred->preds);
  else
-   post_order[post_order_num++] = pred->index;
+   post_order[post_order_num--] = pred->index;
}
   else
{
  if (flow_bb_inside_loop_p (loop, bb)
  && ei_one_before_end_p (ei))

[PATCH 2/3] change inverted_post_order_compute to inverted_rev_post_order_compute

2023-04-21 Thread Richard Biener via Gcc-patches
The following changes the inverted_post_order_compute API back to
a plain C array interface and computing a reverse post order since
that's what's always required.  It will make massaging DF to use
the correct iteration orders easier.  Elsewhere it requires turning
backward iteration over the computed order with forward iteration.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

* cfganal.h (inverted_rev_post_order_compute): Rename
from ...
(inverted_post_order_compute): ... this.  Add struct function
argument, change allocation to a C array.
* cfganal.cc (inverted_rev_post_order_compute): Likewise.
* lcm.cc (compute_antinout_edge): Adjust.
* lra-lives.cc (lra_create_live_ranges_1): Likewise.
* tree-ssa-dce.cc (remove_dead_stmt): Likewise.
* tree-ssa-pre.cc (compute_antic): Likewise.
---
 gcc/cfganal.cc  | 41 ++---
 gcc/cfganal.h   |  3 ++-
 gcc/lcm.cc  |  9 +
 gcc/lra-lives.cc| 11 ++-
 gcc/tree-ssa-dce.cc | 15 ---
 gcc/tree-ssa-pre.cc | 18 ++
 6 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/gcc/cfganal.cc b/gcc/cfganal.cc
index ef24c5e4d15..cc858b99e64 100644
--- a/gcc/cfganal.cc
+++ b/gcc/cfganal.cc
@@ -740,7 +740,7 @@ post_order_compute (int *post_order, bool 
include_entry_exit,
 }
 
 
-/* Helper routine for inverted_post_order_compute
+/* Helper routine for inverted_rev_post_order_compute
flow_dfs_compute_reverse_execute, and the reverse-CFG
deapth first search in dominance.cc.
BB has to belong to a region of CFG
@@ -820,12 +820,14 @@ dfs_find_deadend (basic_block bb)
and start looking for a "dead end" from that block
and do another inverted traversal from that block.  */
 
-void
-inverted_post_order_compute (vec *post_order,
-sbitmap *start_points)
+int
+inverted_rev_post_order_compute (struct function *fn,
+int *rev_post_order,
+sbitmap *start_points)
 {
   basic_block bb;
-  post_order->reserve_exact (n_basic_blocks_for_fn (cfun));
+
+  int rev_post_order_num = n_basic_blocks_for_fn (fn) - 1;
 
   if (flag_checking)
 verify_no_unreachable_blocks ();
@@ -855,17 +857,17 @@ inverted_post_order_compute (vec *post_order,
}
 }
   else
-  /* Put all blocks that have no successor into the initial work list.  */
-  FOR_ALL_BB_FN (bb, cfun)
-if (EDGE_COUNT (bb->succs) == 0)
-  {
-/* Push the initial edge on to the stack.  */
-if (EDGE_COUNT (bb->preds) > 0)
-  {
-   stack.quick_push (ei_start (bb->preds));
-bitmap_set_bit (visited, bb->index);
-  }
-  }
+/* Put all blocks that have no successor into the initial work list.  */
+FOR_ALL_BB_FN (bb, cfun)
+  if (EDGE_COUNT (bb->succs) == 0)
+   {
+ /* Push the initial edge on to the stack.  */
+ if (EDGE_COUNT (bb->preds) > 0)
+   {
+ stack.quick_push (ei_start (bb->preds));
+ bitmap_set_bit (visited, bb->index);
+   }
+   }
 
   do
 {
@@ -893,13 +895,13 @@ inverted_post_order_compute (vec *post_order,
time, check its predecessors.  */
stack.quick_push (ei_start (pred->preds));
   else
-   post_order->quick_push (pred->index);
+   rev_post_order[rev_post_order_num--] = pred->index;
 }
   else
 {
  if (bb != EXIT_BLOCK_PTR_FOR_FN (cfun)
  && ei_one_before_end_p (ei))
-   post_order->quick_push (bb->index);
+   rev_post_order[rev_post_order_num--] = bb->index;
 
   if (!ei_one_before_end_p (ei))
ei_next ( ());
@@ -957,7 +959,8 @@ inverted_post_order_compute (vec *post_order,
   while (!stack.is_empty ());
 
   /* EXIT_BLOCK is always included.  */
-  post_order->quick_push (EXIT_BLOCK);
+  rev_post_order[rev_post_order_num--] = EXIT_BLOCK;
+  return n_basic_blocks_for_fn (fn);
 }
 
 /* Compute the depth first search order of FN and store in the array
diff --git a/gcc/cfganal.h b/gcc/cfganal.h
index 0b6c67dfab5..5af917c27dd 100644
--- a/gcc/cfganal.h
+++ b/gcc/cfganal.h
@@ -66,7 +66,8 @@ extern void add_noreturn_fake_exit_edges (void);
 extern void connect_infinite_loops_to_exit (void);
 extern int post_order_compute (int *, bool, bool);
 extern basic_block dfs_find_deadend (basic_block);
-extern void inverted_post_order_compute (vec *postorder, sbitmap 
*start_points = 0);
+extern int inverted_rev_post_order_compute (struct function *,
+   int *, sbitmap *start_points = 0);
 extern int pre_and_rev_post_order_compute_fn (struct function *,
  int *, int *, bool);
 extern int pre_and_rev_post_order_compute (int *, int *, bool);
diff --git 

[PATCH 1/3] change DF to use the proper CFG order for DF_FORWARD problems

2023-04-21 Thread Richard Biener via Gcc-patches
This changes DF to use RPO on the forward graph for DF_FORWARD
problems.  While that naturally maps to pre_and_rev_postorder_compute
we use the existing (wrong) CFG order for DF_BACKWARD problems
computed by post_order_compute since that provides the required
side-effect of deleting unreachable blocks.

The change requires turning the inconsistent vec vs int * back
to consistent int *.  A followup patch will change the
inverted_post_order_compute API and change the DF_BACKWARD problem
to use the correct RPO on the backward graph together with statistics
I produced last year for the combined effect.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

* df.h (df_d::postorder_inverted): Change back to int *,
clarify comments.
* df-core.cc (rest_of_handle_df_finish): Adjust.
(df_analyze_1): Likewise.
(df_analyze): For DF_FORWARD problems use RPO on the forward
graph.  Adjust.
(loop_inverted_post_order_compute): Adjust API.
(df_analyze_loop): Adjust.
(df_get_n_blocks): Likewise.
(df_get_postorder): Likewise.
---
 gcc/df-core.cc | 58 ++
 gcc/df.h   |  8 +++
 2 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index de5cbd0c622..27123645da5 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -810,7 +810,7 @@ rest_of_handle_df_finish (void)
 }
 
   free (df->postorder);
-  df->postorder_inverted.release ();
+  free (df->postorder_inverted);
   free (df->hard_regs_live_count);
   free (df);
   df = NULL;
@@ -1207,9 +1207,6 @@ df_analyze_1 (void)
 {
   int i;
 
-  /* These should be the same.  */
-  gcc_assert ((unsigned) df->n_blocks == df->postorder_inverted.length ());
-
   /* We need to do this before the df_verify_all because this is
  not kept incrementally up to date.  */
   df_compute_regs_ever_live (false);
@@ -1232,8 +1229,8 @@ df_analyze_1 (void)
   if (dflow->problem->dir == DF_FORWARD)
 df_analyze_problem (dflow,
 df->blocks_to_analyze,
-   df->postorder_inverted.address (),
-   df->postorder_inverted.length ());
+   df->postorder_inverted,
+   df->n_blocks);
   else
 df_analyze_problem (dflow,
 df->blocks_to_analyze,
@@ -1261,10 +1258,15 @@ df_analyze (void)
   bitmap current_all_blocks = BITMAP_ALLOC (_bitmap_obstack);
 
   free (df->postorder);
+  free (df->postorder_inverted);
   df->postorder = XNEWVEC (int, last_basic_block_for_fn (cfun));
   df->n_blocks = post_order_compute (df->postorder, true, true);
-  df->postorder_inverted.truncate (0);
-  inverted_post_order_compute (>postorder_inverted);
+  /* For DF_FORWARD use a RPO on the forward graph.  Since we want to
+ have unreachable blocks deleted use post_order_compute and reverse
+ the order.  */
+  df->postorder_inverted = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
+  for (int i = 0; i < df->n_blocks; ++i)
+df->postorder_inverted[i] = df->postorder[df->n_blocks - 1 - i];
 
   for (int i = 0; i < df->n_blocks; i++)
 bitmap_set_bit (current_all_blocks, df->postorder[i]);
@@ -1273,7 +1275,7 @@ df_analyze (void)
 {
   /* Verify that POSTORDER_INVERTED only contains blocks reachable from
 the ENTRY block.  */
-  for (unsigned int i = 0; i < df->postorder_inverted.length (); i++)
+  for (int i = 0; i < df->n_blocks; i++)
gcc_assert (bitmap_bit_p (current_all_blocks,
  df->postorder_inverted[i]));
 }
@@ -1283,12 +1285,11 @@ df_analyze (void)
   if (df->analyze_subset)
 {
   bitmap_and_into (df->blocks_to_analyze, current_all_blocks);
-  df->n_blocks = df_prune_to_subcfg (df->postorder,
-df->n_blocks, df->blocks_to_analyze);
-  unsigned int newlen = df_prune_to_subcfg (df->postorder_inverted.address 
(),
-   df->postorder_inverted.length 
(),
- df->blocks_to_analyze);
-  df->postorder_inverted.truncate (newlen);
+  unsigned int newlen = df_prune_to_subcfg (df->postorder, df->n_blocks,
+   df->blocks_to_analyze);
+  df_prune_to_subcfg (df->postorder_inverted, df->n_blocks,
+ df->blocks_to_analyze);
+  df->n_blocks = newlen;
   BITMAP_FREE (current_all_blocks);
 }
   else
@@ -1364,14 +1365,13 @@ loop_post_order_compute (int *post_order, class loop 
*loop)
 /* Compute the reverse top sort order of the inverted sub-CFG specified
by LOOP.  Returns the number of blocks which is always loop->num_nodes.  */
 
-static void
-loop_inverted_post_order_compute (vec *post_order, class loop *loop)
+static int

Re: [PATCH] Implement range-op entry for sin/cos.

2023-04-21 Thread Siddhesh Poyarekar

On 2023-04-21 02:52, Jakub Jelinek wrote:

On Thu, Apr 20, 2023 at 09:14:10PM -0400, Siddhesh Poyarekar wrote:

On 2023-04-20 13:57, Siddhesh Poyarekar wrote:

For bounds that aren't representable, one could get error bounds from
libm-test-ulps data in glibc, although I reckon those won't be
exhaustive.  From a quick peek at the sin/cos data, the arc target seems
to be among the worst performers at about 7ulps, although if you include
the complex routines we get close to 13 ulps.  The very worst
imprecision among all math routines (that's gamma) is at 16 ulps for
power in glibc tests, so maybe allowing about 25-30 ulps error in bounds
might work across the board.


I was thinking about this a bit more and it seems like limiting ranges to
targets that can generate sane results (i.e. error bounds within, say, 5-6
ulps) and for the rest, avoid emitting the ranges altogether. Emitting a bad
range for all architectures seems like a net worse solution again.


Well, at least for basic arithmetics when libm functions aren't involved,
there is no point in disabling ranges altogether.


Oh yeah, I did mean only franges for math function call results.


And, for libm functions, my plan was to introduce a target hook, which
would have combined_fn argument to tell which function is queried,
machine_mode to say which floating point format and perhaps a bool whether
it is ulps for these basic math boundaries or results somewhere in between,
and would return in unsigned int ulps, 0 for 0.5ulps precision.
So, we could say for CASE_CFN_SIN: CASE_CFN_COS: in the glibc handler
say that ulps is say 3 inside of the ranges and 0 on the boundaries if
!flag_rounding_math and 6 and 2 with flag_rounding_math or whatever.
And in the generic code don't assume anything if ulps is say 100 or more.
The hooks would need to be a union of precision of supported versions of
the library through the history, including say libmvec because function
calls could be vectorized.
And default could be that infinite precision.
Back in November I've posted a proglet that can generate some ulps from
random number testing, plus on glibc we could pick maximums from ulps files.
And if needed, say powerpc*-linux could override the generic glibc
version for some subset of functions and call default otherwise (say at
least for __ibm128).


Ack, that sounds like a plan.

Thanks,
Sid


RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut optimization

2023-04-21 Thread Li, Pan2 via Gcc-patches
Thanks kito, will try to reproduce this issue and keep you posted.

Pan

-Original Message-
From: Kito Cheng  
Sent: Friday, April 21, 2023 6:17 PM
To: Li, Pan2 
Cc: juzhe.zh...@rivai.ai; gcc-patches ; Kito.cheng 
; Wang, Yanzhang 
Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
optimization

I got a bunch of new fails including ICE for gcc testsuite, and some cases are 
hanging there, could you take a look?

$ riscv64-unknown-linux-gnu-gcc
gcc.target/riscv/rvv/vsetvl/avl_single-92.c -O2 -march=rv32gcv
-mabi=ilp32
during RTL pass: expand
/scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:
In function 'f':
/scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:8:13:
internal compiler error: in maybe_gen_insn, at optabs.cc:8102
8 |   vbool64_t mask = *(vbool64_t*) (in + 100);
  | ^~~~
0x130d278 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
../../../../riscv-gnu-toolchain-trunk/gcc/gcc/optabs.cc:8102


On Fri, Apr 21, 2023 at 5:47 PM Li, Pan2 via Gcc-patches 
 wrote:
>
> Kindly ping for the PATCH v2. Just FYI there will be some underlying 
> investigation based on this PATCH like VMSEQ.
>
> Pan
>
> -Original Message-
> From: Li, Pan2
> Sent: Wednesday, April 19, 2023 7:27 PM
> To: 'Kito Cheng' ; 'juzhe.zh...@rivai.ai' 
> 
> Cc: 'gcc-patches' ; 'Kito.cheng' 
> ; Wang, Yanzhang 
> Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> Update the Patch v2 for more detail information for clarification. Please 
> help to review continuously.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616175.html
>
> Pan
>
> -Original Message-
> From: Li, Pan2
> Sent: Wednesday, April 19, 2023 6:33 PM
> To: Kito Cheng ; juzhe.zh...@rivai.ai
> Cc: gcc-patches ; Kito.cheng 
> ; Wang, Yanzhang 
> Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> Sure thing.
>
> For Changlog, I consider it was generated automatically in previous. LOL.
>
> Pan
>
> -Original Message-
> From: Kito Cheng 
> Sent: Wednesday, April 19, 2023 5:46 PM
> To: juzhe.zh...@rivai.ai
> Cc: Li, Pan2 ; gcc-patches 
> ; Kito.cheng ; Wang, 
> Yanzhang 
> Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> HI JuZhe:
>
> Thanks for explaining!
>
>
> Hi Pan:
>
> I think that would be helpful if JuZhe's explaining that could be written 
> into the commit log.
>
>
> > gcc/ChangeLog:
> >
> >* config/riscv/riscv-v.cc (emit_pred_op):
> >* config/riscv/riscv-vector-builtins-bases.cc:
> >* config/riscv/vector.md:
>
> And don't forgot write some thing in ChangeLog...:P


Re: [RFC 0/X] Implement GCC support for AArch64 libmvec

2023-04-21 Thread Jakub Jelinek via Gcc-patches
On Fri, Apr 21, 2023 at 10:54:51AM +0100, Richard Sandiford wrote:
> > I'm guessing the keyword here is 'trait' which I'm guessing is different 
> > from a omp declare simd directive, which is why it's not required to 
> > have a simdlen clause in an omp declare simd (see Jakub's comment).
> 
> Sure.  The thread above is about whether we need extension("scalable")
> or should drop it.  And extension("scalable") is only used in omp
> declare variant.  This was in response to "I also do not see a need
> for the 'omp declare variant' scalable extension constructs".

I'm not sure extension("scalable") in context selectors is what you want
to handle declare variant.  While extension trait is allowed and it is
implementation defined what is accepted as its arguments (within the
boundaries of allowed syntax), in this case you really want to adjust
behavior of the simd trait, so it would be better to specify you want
scalable simdlen using a simd trait property.

There will be OpenMP F2F meeting next month, I think this should be
discussed there and agreed on how to do this, after all, seems ARM
won't be the only architecture that needs it, RISC-V might be another.

Jakub



Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut optimization

2023-04-21 Thread Kito Cheng via Gcc-patches
I got a bunch of new fails including ICE for gcc testsuite, and some
cases are hanging there, could you take a look?

$ riscv64-unknown-linux-gnu-gcc
gcc.target/riscv/rvv/vsetvl/avl_single-92.c -O2 -march=rv32gcv
-mabi=ilp32
during RTL pass: expand
/scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:
In function 'f':
/scratch1/kitoc/riscv-gnu-workspace/riscv-gnu-toolchain-trunk/gcc/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/avl_single-92.c:8:13:
internal compiler error: in maybe_gen_insn, at optabs.cc:8102
8 |   vbool64_t mask = *(vbool64_t*) (in + 100);
  | ^~~~
0x130d278 maybe_gen_insn(insn_code, unsigned int, expand_operand*)
../../../../riscv-gnu-toolchain-trunk/gcc/gcc/optabs.cc:8102


On Fri, Apr 21, 2023 at 5:47 PM Li, Pan2 via Gcc-patches
 wrote:
>
> Kindly ping for the PATCH v2. Just FYI there will be some underlying 
> investigation based on this PATCH like VMSEQ.
>
> Pan
>
> -Original Message-
> From: Li, Pan2
> Sent: Wednesday, April 19, 2023 7:27 PM
> To: 'Kito Cheng' ; 'juzhe.zh...@rivai.ai' 
> 
> Cc: 'gcc-patches' ; 'Kito.cheng' 
> ; Wang, Yanzhang 
> Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> Update the Patch v2 for more detail information for clarification. Please 
> help to review continuously.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616175.html
>
> Pan
>
> -Original Message-
> From: Li, Pan2
> Sent: Wednesday, April 19, 2023 6:33 PM
> To: Kito Cheng ; juzhe.zh...@rivai.ai
> Cc: gcc-patches ; Kito.cheng 
> ; Wang, Yanzhang 
> Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> Sure thing.
>
> For Changlog, I consider it was generated automatically in previous. LOL.
>
> Pan
>
> -Original Message-
> From: Kito Cheng 
> Sent: Wednesday, April 19, 2023 5:46 PM
> To: juzhe.zh...@rivai.ai
> Cc: Li, Pan2 ; gcc-patches ; 
> Kito.cheng ; Wang, Yanzhang 
> Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
> optimization
>
> HI JuZhe:
>
> Thanks for explaining!
>
>
> Hi Pan:
>
> I think that would be helpful if JuZhe's explaining that could be written 
> into the commit log.
>
>
> > gcc/ChangeLog:
> >
> >* config/riscv/riscv-v.cc (emit_pred_op):
> >* config/riscv/riscv-vector-builtins-bases.cc:
> >* config/riscv/vector.md:
>
> And don't forgot write some thing in ChangeLog...:P


[PATCH] RISC-V: decouple stack allocation for rv32e w/o save-restore.

2023-04-21 Thread Fei Gao
Currently in rv32e, stack allocation for GPR callee-saved registers is
always 12 bytes w/o save-restore. Actually, for the case without save-restore,
less stack memory can be reserved. This patch decouples stack allocation for
rv32e w/o save-restore and makes riscv_compute_frame_info more readable.

output of testcase rv32e_stack.c
before patch:
addisp,sp,-16
sw  ra,12(sp)
callgetInt
sw  a0,0(sp)
lw  a0,0(sp)
callPrintInts
lw  a5,0(sp)
mv  a0,a5
lw  ra,12(sp)
addisp,sp,16
jr  ra

after patch:
addisp,sp,-8
sw  ra,4(sp)
callgetInt
sw  a0,0(sp)
lw  a0,0(sp)
callPrintInts
lw  a5,0(sp)
mv  a0,a5
lw  ra,4(sp)
addisp,sp,8
jr  ra

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_forbid_save_libcall): helper function 
for riscv_use_save_libcall.
(riscv_use_save_libcall): call riscv_forbid_save_libcall.
(riscv_compute_frame_info): restructure to decouple stack allocation 
for rv32e w/o save-restore.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rv32e_stack.c: New test.
---
 gcc/config/riscv/riscv.cc| 57 
 gcc/testsuite/gcc.target/riscv/rv32e_stack.c | 14 +
 2 files changed, 49 insertions(+), 22 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rv32e_stack.c

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 5d2550871c7..6ccdfe96fe7 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -4772,12 +4772,26 @@ riscv_save_reg_p (unsigned int regno)
   return false;
 }
 
+/* Determine whether to disable GPR save/restore routines.  */
+static bool
+riscv_forbid_save_libcall (void)
+{
+  if (!TARGET_SAVE_RESTORE
+  || crtl->calls_eh_return
+  || frame_pointer_needed
+  || cfun->machine->interrupt_handler_p
+  || cfun->machine->varargs_size != 0
+  || crtl->args.pretend_args_size != 0)
+return true;
+
+  return false;
+}
+
 /* Determine whether to call GPR save/restore routines.  */
 static bool
 riscv_use_save_libcall (const struct riscv_frame_info *frame)
 {
-  if (!TARGET_SAVE_RESTORE || crtl->calls_eh_return || frame_pointer_needed
-  || cfun->machine->interrupt_handler_p)
+  if (riscv_forbid_save_libcall ())
 return false;
 
   return frame->save_libcall_adjustment != 0;
@@ -4857,7 +4871,7 @@ riscv_compute_frame_info (void)
   struct riscv_frame_info *frame;
   poly_int64 offset;
   bool interrupt_save_prologue_temp = false;
-  unsigned int regno, i, num_x_saved = 0, num_f_saved = 0;
+  unsigned int regno, i, num_x_saved = 0, num_f_saved = 0, x_save_size = 0;
 
   frame = >machine->frame;
 
@@ -4895,24 +4909,14 @@ riscv_compute_frame_info (void)
frame->fmask |= 1 << (regno - FP_REG_FIRST), num_f_saved++;
 }
 
-  /* At the bottom of the frame are any outgoing stack arguments. */
-  offset = riscv_stack_align (crtl->outgoing_args_size);
-  /* Next are local stack variables. */
-  offset += riscv_stack_align (get_frame_size ());
-  /* The virtual frame pointer points above the local variables. */
-  frame->frame_pointer_offset = offset;
-  /* Next are the callee-saved FPRs. */
-  if (frame->fmask)
-offset += riscv_stack_align (num_f_saved * UNITS_PER_FP_REG);
-  frame->fp_sp_offset = offset - UNITS_PER_FP_REG;
-  /* Next are the callee-saved GPRs. */
   if (frame->mask)
 {
-  unsigned x_save_size = riscv_stack_align (num_x_saved * UNITS_PER_WORD);
+  x_save_size = riscv_stack_align (num_x_saved * UNITS_PER_WORD);
   unsigned num_save_restore = 1 + riscv_save_libcall_count (frame->mask);
 
   /* Only use save/restore routines if they don't alter the stack size.  */
-  if (riscv_stack_align (num_save_restore * UNITS_PER_WORD) == x_save_size)
+  if (riscv_stack_align (num_save_restore * UNITS_PER_WORD) == x_save_size
+  && !riscv_forbid_save_libcall ())
{
  /* Libcall saves/restores 3 registers at once, so we need to
 allocate 12 bytes for callee-saved register.  */
@@ -4921,9 +4925,21 @@ riscv_compute_frame_info (void)
 
  frame->save_libcall_adjustment = x_save_size;
}
-
-  offset += x_save_size;
 }
+
+  /* At the bottom of the frame are any outgoing stack arguments. */
+  offset = riscv_stack_align (crtl->outgoing_args_size);
+  /* Next are local stack variables. */
+  offset += riscv_stack_align (get_frame_size ());
+  /* The virtual frame pointer points above the local variables. */
+  frame->frame_pointer_offset = offset;
+  /* Next are the callee-saved FPRs. */
+  if (frame->fmask)
+offset += riscv_stack_align (num_f_saved * UNITS_PER_FP_REG);
+  frame->fp_sp_offset = offset - UNITS_PER_FP_REG;
+  /* Next are the callee-saved GPRs. */
+  if (frame->mask)
+offset += x_save_size;
   

Re: [RFC 0/X] Implement GCC support for AArch64 libmvec

2023-04-21 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> On 20/04/2023 17:13, Richard Sandiford wrote:
>> "Andre Vieira (lists)"  writes:
>>> On 20/04/2023 15:51, Richard Sandiford wrote:
 "Andre Vieira (lists)"  writes:
> Hi all,
>
> This is a series of patches/RFCs to implement support in GCC to be able
> to target AArch64's libmvec functions that will be/are being added to 
> glibc.
> We have chosen to use the omp pragma '#pragma omp declare variant ...'
> with a simd construct as the way for glibc to inform GCC what functions
> are available.
>
> For example, if we would like to supply a vector version of the scalar
> 'cosf' we would have an include file with something like:
> typedef __attribute__((__neon_vector_type__(4))) float __f32x4_t;
> typedef __attribute__((__neon_vector_type__(2))) float __f32x2_t;
> typedef __SVFloat32_t __sv_f32_t;
> typedef __SVBool_t __sv_bool_t;
> __f32x4_t _ZGVnN4v_cosf (__f32x4_t);
> __f32x2_t _ZGVnN2v_cosf (__f32x2_t);
> __sv_f32_t _ZGVsMxv_cosf (__sv_f32_t, __sv_bool_t);
> #pragma omp declare variant(_ZGVnN4v_cosf) \
>match(construct = {simd(notinbranch, simdlen(4))}, device =
> {isa("simd")})
> #pragma omp declare variant(_ZGVnN2v_cosf) \
>match(construct = {simd(notinbranch, simdlen(2))}, device =
> {isa("simd")})
> #pragma omp declare variant(_ZGVsMxv_cosf) \
>match(construct = {simd(inbranch)}, device = {isa("sve")})
> extern float cosf (float);
>
> The BETA ABI can be found in the vfabia64 subdir of
> https://github.com/ARM-software/abi-aa/
> This currently disagrees with how this patch series implements 'omp
> declare simd' for SVE and I also do not see a need for the 'omp declare
> variant' scalable extension constructs. I will make changes to the ABI
> once we've finalized the co-design of the ABI and this implementation.

 I don't see a good reason for dropping the extension("scalable").
 The problem is that since the base spec requires a simdlen clause,
 GCC should in general raise an error if simdlen is omitted.
>>> Where can you find this in the specs? I tried to find it but couldn't.
>>>
>>> Leaving out simdlen in a 'omp declare simd' I assume is OK, our vector
>>> ABI defines behaviour for this. But I couldn't find what it meant for a
>>> omp declare variant, obviously can't be the same as for declare simd, as
>>> that is defined to mean 'define a set of clones' and only one clone can
>>> be associated to a declare variant.
>> 
>> I was going from https://www.openmp.org/spec-html/5.0/openmpsu25.html ,
>> which says:
>> 
>>The simd trait can be further defined with properties that match the
>>clauses accepted by the declare simd directive with the same name and
>>semantics. The simd trait must define at least the simdlen property and
>>one of the inbranch or notinbranch properties.
>> 
>> (probably best to read it in the original -- it's almost incomprehensible
>> without markup)
>> 
> I'm guessing the keyword here is 'trait' which I'm guessing is different 
> from a omp declare simd directive, which is why it's not required to 
> have a simdlen clause in an omp declare simd (see Jakub's comment).

Sure.  The thread above is about whether we need extension("scalable")
or should drop it.  And extension("scalable") is only used in omp
declare variant.  This was in response to "I also do not see a need
for the 'omp declare variant' scalable extension constructs".

Not having a simdlen on an omp declare simd is of course OK (and the
VFABI defines behaviour for that case).

Richard


Re: [PATCH V4] RISC-V: Defer vsetvli insertion to later if possible [PR108270]

2023-04-21 Thread Kito Cheng via Gcc-patches
Thanks, committed to trunk.

On Fri, Apr 21, 2023 at 5:19 PM  wrote:
>
> From: Juzhe-Zhong 
>
> Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.
>
> Consider the following testcase:
> void f (void * restrict in, void * restrict out, int l, int n, int m)
> {
>   for (int i = 0; i < l; i++){
> for (int j = 0; j < m; j++){
>   for (int k = 0; k < n; k++)
> {
>   vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
>   __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
> }
> }
>   }
> }
>
> Compile option: -O3
>
> Before this patch:
> mv  a7,a2
> mv  a6,a0
> mv  t1,a1
> mv  a2,a3
> vsetivlizero,17,e8,mf8,ta,ma
> ble a7,zero,.L1
> ble a4,zero,.L1
> ble a3,zero,.L1
> ...
>
> After this patch:
> mv  a7,a2
> mv  a6,a0
> mv  t1,a1
> mv  a2,a3
> ble a7,zero,.L1
> ble a4,zero,.L1
> ble a3,zero,.L1
> add a1,a0,a4
> li  a0,0
> vsetivlizero,17,e8,mf8,ta,ma
> ...
>
> This issue is a missed optmization produced by Phase 3 global backward demand 
> fusion instead of
> LCM.
>
> This patch is fixing poor placement of the vsetvl.
>
> This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info 
> backward fusion and propogation) which
> is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction 
> performance.
>
> This patch is to supress the Phase 3 too aggressive backward fusion and 
> propagation to the top of the function program
> when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli 
> instruction allows imm value instead of reg).
>
> You may want to ask why we need Phase 3 to the job.
> Well, we have so many situations that pure LCM fails to optimize, here I can 
> show you a simple case to demonstrate it:
>
> void f (void * restrict in, void * restrict out, int n, int m, int cond)
> {
>   size_t vl = 101;
>   for (size_t j = 0; j < m; j++){
> if (cond) {
>   for (size_t i = 0; i < n; i++)
> {
>   vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
>   __riscv_vse8_v_i8mf8 (out + i, v, vl);
> }
> } else {
>   for (size_t i = 0; i < n; i++)
> {
>   vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
>   v = __riscv_vadd_vv_i32mf2 (v,v,vl);
>   __riscv_vse32_v_i32mf2 (out + i, v, vl);
> }
> }
>   }
> }
>
> You can see:
> The first inner loop needs vsetvli e8 mf8 for vle+vse.
> The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.
>
> If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with 
> :
>
> outerloop:
> ...
> vsetvli e8mf8
> inner loop 1:
> 
>
> vsetvli e32mf2
> inner loop 2:
> 
>
> However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of 
> inner loop 2 into vsetvli e8 mf8, then we will end up with this result after 
> phase 3:
>
> outerloop:
> ...
> inner loop 1:
> vsetvli e32mf2
> 
>
> inner loop 2:
> vsetvli e32mf2
> 
>
> Then, this demand information after phase 3 will be well optimized after 
> phase 4 (LCM), after Phase 4 result is:
>
> vsetvli e32mf2
> outerloop:
> ...
> inner loop 1:
> 
>
> inner loop 2:
> 
>
> You can see this is the optimal codegen after current VSETVL PASS (Phase 3: 
> Demand backward fusion and propagation + Phase 4: LCM ). This is a known 
> issue when I start to implement VSETVL PASS.
>
> PR 108270
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-vsetvl.cc 
> (vector_infos_manager::all_empty_predecessor_p): New function.
> (pass_vsetvl::backward_demand_fusion): Ditto.
> * config/riscv/riscv-vsetvl.h: Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
> * gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
> * gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.
>
> ---
>  gcc/config/riscv/riscv-vsetvl.cc  | 23 +++
>  gcc/config/riscv/riscv-vsetvl.h   |  2 ++
>  .../riscv/rvv/vsetvl/imm_bb_prop-1.c  |  2 +-
>  .../riscv/rvv/vsetvl/imm_conflict-3.c |  4 ++--
>  .../gcc.target/riscv/rvv/vsetvl/pr108270.c| 19 +++
>  5 files changed, 47 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr108270.c
>
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> b/gcc/config/riscv/riscv-vsetvl.cc
> index 5f424221659..167e3c6145c 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -2355,6 +2355,21 @@ vector_infos_manager::get_all_available_exprs (
>return available_list;
>  }
>
> +bool
> +vector_infos_manager::all_empty_predecessor_p (const basic_block cfg_bb) 
> const
> +{
> +  hash_set pred_cfg_bbs = get_all_predecessors (cfg_bb);
> +  for 

Re: [RFC 0/X] Implement GCC support for AArch64 libmvec

2023-04-21 Thread Andre Vieira (lists) via Gcc-patches




On 20/04/2023 17:13, Richard Sandiford wrote:

"Andre Vieira (lists)"  writes:

On 20/04/2023 15:51, Richard Sandiford wrote:

"Andre Vieira (lists)"  writes:

Hi all,

This is a series of patches/RFCs to implement support in GCC to be able
to target AArch64's libmvec functions that will be/are being added to glibc.
We have chosen to use the omp pragma '#pragma omp declare variant ...'
with a simd construct as the way for glibc to inform GCC what functions
are available.

For example, if we would like to supply a vector version of the scalar
'cosf' we would have an include file with something like:
typedef __attribute__((__neon_vector_type__(4))) float __f32x4_t;
typedef __attribute__((__neon_vector_type__(2))) float __f32x2_t;
typedef __SVFloat32_t __sv_f32_t;
typedef __SVBool_t __sv_bool_t;
__f32x4_t _ZGVnN4v_cosf (__f32x4_t);
__f32x2_t _ZGVnN2v_cosf (__f32x2_t);
__sv_f32_t _ZGVsMxv_cosf (__sv_f32_t, __sv_bool_t);
#pragma omp declare variant(_ZGVnN4v_cosf) \
   match(construct = {simd(notinbranch, simdlen(4))}, device =
{isa("simd")})
#pragma omp declare variant(_ZGVnN2v_cosf) \
   match(construct = {simd(notinbranch, simdlen(2))}, device =
{isa("simd")})
#pragma omp declare variant(_ZGVsMxv_cosf) \
   match(construct = {simd(inbranch)}, device = {isa("sve")})
extern float cosf (float);

The BETA ABI can be found in the vfabia64 subdir of
https://github.com/ARM-software/abi-aa/
This currently disagrees with how this patch series implements 'omp
declare simd' for SVE and I also do not see a need for the 'omp declare
variant' scalable extension constructs. I will make changes to the ABI
once we've finalized the co-design of the ABI and this implementation.


I don't see a good reason for dropping the extension("scalable").
The problem is that since the base spec requires a simdlen clause,
GCC should in general raise an error if simdlen is omitted.

Where can you find this in the specs? I tried to find it but couldn't.

Leaving out simdlen in a 'omp declare simd' I assume is OK, our vector
ABI defines behaviour for this. But I couldn't find what it meant for a
omp declare variant, obviously can't be the same as for declare simd, as
that is defined to mean 'define a set of clones' and only one clone can
be associated to a declare variant.


I was going from https://www.openmp.org/spec-html/5.0/openmpsu25.html ,
which says:

   The simd trait can be further defined with properties that match the
   clauses accepted by the declare simd directive with the same name and
   semantics. The simd trait must define at least the simdlen property and
   one of the inbranch or notinbranch properties.

(probably best to read it in the original -- it's almost incomprehensible
without markup)

I'm guessing the keyword here is 'trait' which I'm guessing is different 
from a omp declare simd directive, which is why it's not required to 
have a simdlen clause in an omp declare simd (see Jakub's comment).


But for declare variants I guess it does require you to? It doesn't 
'break' anything, just means I need to add support for parsing the 
extension clause as was originally planned.

Richard


RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut optimization

2023-04-21 Thread Li, Pan2 via Gcc-patches
Kindly ping for the PATCH v2. Just FYI there will be some underlying 
investigation based on this PATCH like VMSEQ.

Pan

-Original Message-
From: Li, Pan2 
Sent: Wednesday, April 19, 2023 7:27 PM
To: 'Kito Cheng' ; 'juzhe.zh...@rivai.ai' 

Cc: 'gcc-patches' ; 'Kito.cheng' 
; Wang, Yanzhang 
Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
optimization

Update the Patch v2 for more detail information for clarification. Please help 
to review continuously.

https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616175.html

Pan

-Original Message-
From: Li, Pan2 
Sent: Wednesday, April 19, 2023 6:33 PM
To: Kito Cheng ; juzhe.zh...@rivai.ai
Cc: gcc-patches ; Kito.cheng ; 
Wang, Yanzhang 
Subject: RE: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
optimization

Sure thing.

For Changlog, I consider it was generated automatically in previous. LOL.

Pan

-Original Message-
From: Kito Cheng  
Sent: Wednesday, April 19, 2023 5:46 PM
To: juzhe.zh...@rivai.ai
Cc: Li, Pan2 ; gcc-patches ; 
Kito.cheng ; Wang, Yanzhang 
Subject: Re: Re: [PATCH] RISC-V: Allow VMS{Compare} (V1, V1) shortcut 
optimization

HI JuZhe:

Thanks for explaining!


Hi Pan:

I think that would be helpful if JuZhe's explaining that could be written into 
the commit log.


> gcc/ChangeLog:
>
>* config/riscv/riscv-v.cc (emit_pred_op):
>* config/riscv/riscv-vector-builtins-bases.cc:
>* config/riscv/vector.md:

And don't forgot write some thing in ChangeLog...:P


Re: [PATCH] MAINTAINERS: add Vineet Gupta to write after approval

2023-04-21 Thread Richard Sandiford via Gcc-patches
Palmer Dabbelt  writes:
> On Thu, 20 Apr 2023 09:55:23 PDT (-0700), Vineet Gupta wrote:
>> ChangeLog:
>>
>>  * MAINTAINERS (Write After Approval): Add myself.
>>
>> (Ref: <680c7bbe-5d6e-07cd-8468-247afc65e...@gmail.com>)
>>
>> Signed-off-by: Vineet Gupta 
>> ---
>>  MAINTAINERS | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cebf45d49e56..5f25617212a5 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -434,6 +434,7 @@ Haochen Gui  
>> 
>>  Jiufu Guo   
>>  Xuepeng Guo 
>>  Wei Guozhi  
>> +Vineet Gupta
>>  Naveen H.S  
>>  Mostafa Hagog   
>>  Andrew Haley
>
> Acked-by: Palmer Dabbelt 
>
> Though not sure if I can do that, maybe we need a global reviewer?

No approval is needed when adding oneself to write-after-approval.
The fact that one's able to make the change is proof enough.

Richard


[PATCH V4] RISC-V: Defer vsetvli insertion to later if possible [PR108270]

2023-04-21 Thread juzhe . zhong
From: Juzhe-Zhong 

Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.

Consider the following testcase:
void f (void * restrict in, void * restrict out, int l, int n, int m)
{
  for (int i = 0; i < l; i++){
for (int j = 0; j < m; j++){
  for (int k = 0; k < n; k++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
  __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
}
}
  }
}

Compile option: -O3

Before this patch:
mv  a7,a2
mv  a6,a0
mv  t1,a1
mv  a2,a3
vsetivlizero,17,e8,mf8,ta,ma
ble a7,zero,.L1
ble a4,zero,.L1
ble a3,zero,.L1
...

After this patch:
mv  a7,a2
mv  a6,a0
mv  t1,a1
mv  a2,a3
ble a7,zero,.L1
ble a4,zero,.L1
ble a3,zero,.L1
add a1,a0,a4
li  a0,0
vsetivlizero,17,e8,mf8,ta,ma
...

This issue is a missed optmization produced by Phase 3 global backward demand 
fusion instead of
LCM.

This patch is fixing poor placement of the vsetvl.

This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info 
backward fusion and propogation) which
is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction 
performance.

This patch is to supress the Phase 3 too aggressive backward fusion and 
propagation to the top of the function program
when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli 
instruction allows imm value instead of reg).

You may want to ask why we need Phase 3 to the job.
Well, we have so many situations that pure LCM fails to optimize, here I can 
show you a simple case to demonstrate it:

void f (void * restrict in, void * restrict out, int n, int m, int cond)
{
  size_t vl = 101;
  for (size_t j = 0; j < m; j++){
if (cond) {
  for (size_t i = 0; i < n; i++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
  __riscv_vse8_v_i8mf8 (out + i, v, vl);
}
} else {
  for (size_t i = 0; i < n; i++)
{
  vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
  v = __riscv_vadd_vv_i32mf2 (v,v,vl);
  __riscv_vse32_v_i32mf2 (out + i, v, vl);
}
}
  }
}

You can see:
The first inner loop needs vsetvli e8 mf8 for vle+vse.
The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.

If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with :

outerloop:
...
vsetvli e8mf8
inner loop 1:


vsetvli e32mf2
inner loop 2:


However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of 
inner loop 2 into vsetvli e8 mf8, then we will end up with this result after 
phase 3:

outerloop:
...
inner loop 1:
vsetvli e32mf2


inner loop 2:
vsetvli e32mf2


Then, this demand information after phase 3 will be well optimized after phase 
4 (LCM), after Phase 4 result is:

vsetvli e32mf2
outerloop:
...
inner loop 1:


inner loop 2:


You can see this is the optimal codegen after current VSETVL PASS (Phase 3: 
Demand backward fusion and propagation + Phase 4: LCM ). This is a known issue 
when I start to implement VSETVL PASS.

PR 108270

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc 
(vector_infos_manager::all_empty_predecessor_p): New function.
(pass_vsetvl::backward_demand_fusion): Ditto.
* config/riscv/riscv-vsetvl.h: Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
* gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.

---
 gcc/config/riscv/riscv-vsetvl.cc  | 23 +++
 gcc/config/riscv/riscv-vsetvl.h   |  2 ++
 .../riscv/rvv/vsetvl/imm_bb_prop-1.c  |  2 +-
 .../riscv/rvv/vsetvl/imm_conflict-3.c |  4 ++--
 .../gcc.target/riscv/rvv/vsetvl/pr108270.c| 19 +++
 5 files changed, 47 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr108270.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 5f424221659..167e3c6145c 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2355,6 +2355,21 @@ vector_infos_manager::get_all_available_exprs (
   return available_list;
 }
 
+bool
+vector_infos_manager::all_empty_predecessor_p (const basic_block cfg_bb) const
+{
+  hash_set pred_cfg_bbs = get_all_predecessors (cfg_bb);
+  for (const basic_block pred_cfg_bb : pred_cfg_bbs)
+{
+  const auto _block_info = vector_block_infos[pred_cfg_bb->index];
+  if (!pred_block_info.local_dem.valid_or_dirty_p ()
+ && !pred_block_info.reaching_out.valid_or_dirty_p ())
+   continue;
+  return false;
+}
+  return true;
+}
+
 bool
 vector_infos_manager::all_same_ratio_p (sbitmap bitdata) const
 {

Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

2023-04-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> Hi,
> I tested the interleave+zip1 for vector init patch and it segfaulted
> during bootstrap while trying to build
> libgfortran/generated/matmul_i2.c.
> Rebuilding with --enable-checking=rtl showed out of bounds access in
> aarch64_unzip_vector_init in following hunk:
>
> +  rtvec vec = rtvec_alloc (n / 2);
> +  for (int i = 0; i < n; i++)
> +RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
> + : XVECEXP (vals, 0, 2 * i + 1);
>
> which is incorrect since it allocates n/2 but iterates and stores upto n.
> The attached patch fixes the issue, which passed bootstrap, however
> resulted in following fallout during testsuite run:
>
> 1] sve/acle/general/dupq_[1-4].c tests fail.
> For the following test:
> int32x4_t f(int32_t x)
> {
>   return (int32x4_t) { x, 1, 2, 3 };
> }
>
> Code-gen without patch:
> f:
> adrpx1, .LC0
> ldr q0, [x1, #:lo12:.LC0]
> ins v0.s[0], w0
> ret
>
> Code-gen with patch:
> f:
> moviv0.2s, 0x2
> adrpx1, .LC0
> ldr d1, [x1, #:lo12:.LC0]
> ins v0.s[0], w0
> zip1v0.4s, v0.4s, v1.4s
> ret
>
> It shows, fallback_seq_cost = 20, seq_total_cost = 16
> where seq_total_cost determines the cost for interleave+zip1 sequence
> and fallback_seq_cost is the cost for fallback sequence.
> Altho it shows lesser cost, I am not sure if the interleave+zip1
> sequence is better in this case ?

Debugging the patch, it looks like this is because the fallback sequence
contains a redundant pseudo-to-pseudo move, which is costed as 1
instruction (4 units).  The RTL equivalent of the:

 moviv0.2s, 0x2
 ins v0.s[0], w0

has a similar redundant move, but the cost of that move is subsumed by
the cost of the other arm (the load from LC0), which is costed as 3
instructions (12 units).  So we have 12 + 4 for the parallel version
(correct) but 12 + 4 + 4 for the serial version (one instruction too
many).

The reason we have redundant moves is that the expansion code uses
copy_to_mode_reg to force a value into a register.  This creates a
new pseudo even if the original value was already a register.
Using force_reg removes the moves and makes the test pass.

So I think the first step is to use force_reg instead of
copy_to_mode_reg in aarch64_simd_dup_constant and
aarch64_expand_vector_init (as a preparatory patch).

> 2] sve/acle/general/dupq_[5-6].c tests fail:
> int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3)
> {
>   return (int32x4_t) { x0, x1, x2, x3 };
> }
>
> code-gen without patch:
> f:
> fmovs0, w0
> ins v0.s[1], w1
> ins v0.s[2], w2
> ins v0.s[3], w3
> ret
>
> code-gen with patch:
> f:
> fmovs0, w0
> fmovs1, w1
> ins v0.s[1], w2
> ins v1.s[1], w3
> zip1v0.4s, v0.4s, v1.4s
> ret
>
> It shows fallback_seq_cost = 28, seq_total_cost = 16

The zip verson still wins after the fix above, but by a lesser amount.
It seems like a borderline case.

>
> 3] aarch64/ldp_stp_16.c's cons2_8_float test fails.
> Test case:
> void cons2_8_float(float *x, float val0, float val1)
> {
> #pragma GCC unroll(8)
>   for (int i = 0; i < 8 * 2; i += 2) {
> x[i + 0] = val0;
> x[i + 1] = val1;
>   }
> }
>
> which is lowered to:
> void cons2_8_float (float * x, float val0, float val1)
> {
>   vector(4) float _86;
>
>[local count: 119292720]:
>   _86 = {val0_11(D), val1_13(D), val0_11(D), val1_13(D)};
>   MEM  [(float *)x_10(D)] = _86;
>   MEM  [(float *)x_10(D) + 16B] = _86;
>   MEM  [(float *)x_10(D) + 32B] = _86;
>   MEM  [(float *)x_10(D) + 48B] = _86;
>   return;
> }
>
> code-gen without patch:
> cons2_8_float:
> dup v0.4s, v0.s[0]
> ins v0.s[1], v1.s[0]
> ins v0.s[3], v1.s[0]
> stp q0, q0, [x0]
> stp q0, q0, [x0, 32]
> ret
>
> code-gen with patch:
> cons2_8_float:
> dup v1.2s, v1.s[0]
> dup v0.2s, v0.s[0]
> zip1v0.4s, v0.4s, v1.4s
> stp q0, q0, [x0]
> stp q0, q0, [x0, 32]
> ret
>
> It shows fallback_seq_cost = 28, seq_total_cost = 16
>
> I think the test fails because it doesn't match:
> **  dup v([0-9]+)\.4s, .*
>
> Shall it be OK to amend the test assuming code-gen with patch is better ?

Yeah, the new code seems like an improvement.

> 4] aarch64/pr109072_1.c s32x4_3 test fails:
> For the following test:
> int32x4_t s32x4_3 (int32_t x, int32_t y)
> {
>   int32_t arr[] = { x, y, y, y };
>   return vld1q_s32 (arr);
> }
>
> code-gen without patch:
> s32x4_3:
> dup v0.4s, w1
> ins v0.s[0], w0
> ret
>
> code-gen with patch:
> s32x4_3:
> fmovs1, w1
> fmovs0, w0
> ins v0.s[1], v1.s[0]
> dup v1.2s, v1.s[0]
> zip1v0.4s, v0.4s, v1.4s
> ret
>
> It shows 

[PATCH V3] RISC-V: Defer vsetvli insertion to later if possible [PR108270]

2023-04-21 Thread juzhe . zhong
From: Juzhe-Zhong 

Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.

Consider the following testcase:
void f (void * restrict in, void * restrict out, int l, int n, int m)
{
  for (int i = 0; i < l; i++){
for (int j = 0; j < m; j++){
  for (int k = 0; k < n; k++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
  __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
}
}
  }
}

Compile option: -O3

Before this patch:
mv  a7,a2
mv  a6,a0
mv  t1,a1
mv  a2,a3
vsetivlizero,17,e8,mf8,ta,ma
ble a7,zero,.L1
ble a4,zero,.L1
ble a3,zero,.L1
...

After this patch:
mv  a7,a2
mv  a6,a0
mv  t1,a1
mv  a2,a3
ble a7,zero,.L1
ble a4,zero,.L1
ble a3,zero,.L1
add a1,a0,a4
li  a0,0
vsetivlizero,17,e8,mf8,ta,ma
...

This issue is a missed optmization produced by Phase 3 global backward demand 
fusion instead of
LCM.

This patch is fixing poor placement of the vsetvl.

This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info 
backward fusion and propogation) which
is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction 
performance.

This patch is to supress the Phase 3 too aggressive backward fusion and 
propagation to the top of the function program
when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli 
instruction allows imm value instead of reg).

You may want to ask why we need Phase 3 to the job.
Well, we have so many situations that pure LCM fails to optimize, here I can 
show you a simple case to demonstrate it:

void f (void * restrict in, void * restrict out, int n, int m, int cond)
{
  size_t vl = 101;
  for (size_t j = 0; j < m; j++){
if (cond) {
  for (size_t i = 0; i < n; i++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
  __riscv_vse8_v_i8mf8 (out + i, v, vl);
}
} else {
  for (size_t i = 0; i < n; i++)
{
  vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
  v = __riscv_vadd_vv_i32mf2 (v,v,vl);
  __riscv_vse32_v_i32mf2 (out + i, v, vl);
}
}
  }
}

You can see:
The first inner loop needs vsetvli e8 mf8 for vle+vse.
The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.

If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with :

outerloop:
...
vsetvli e8mf8
inner loop 1:


vsetvli e32mf2
inner loop 2:


However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of 
inner loop 2 into vsetvli e8 mf8, then we will end up with this result after 
phase 3:

outerloop:
...
inner loop 1:
vsetvli e32mf2


inner loop 2:
vsetvli e32mf2


Then, this demand information after phase 3 will be well optimized after phase 
4 (LCM), after Phase 4 result is:

vsetvli e32mf2
outerloop:
...
inner loop 1:


inner loop 2:


You can see this is the optimal codegen after current VSETVL PASS (Phase 3: 
Demand backward fusion and propagation + Phase 4: LCM ). This is a known issue 
when I start to implement VSETVL PASS.

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc 
(vector_infos_manager::all_empty_predecessor_p): New function.
(pass_vsetvl::backward_demand_fusion): Ditto.
* config/riscv/riscv-vsetvl.h: Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
* gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.

---
 gcc/config/riscv/riscv-vsetvl.cc  | 23 +++
 gcc/config/riscv/riscv-vsetvl.h   |  2 ++
 .../riscv/rvv/vsetvl/imm_bb_prop-1.c  |  2 +-
 .../riscv/rvv/vsetvl/imm_conflict-3.c |  4 ++--
 .../gcc.target/riscv/rvv/vsetvl/pr108270.c| 19 +++
 5 files changed, 47 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr108270.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 5f424221659..167e3c6145c 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2355,6 +2355,21 @@ vector_infos_manager::get_all_available_exprs (
   return available_list;
 }
 
+bool
+vector_infos_manager::all_empty_predecessor_p (const basic_block cfg_bb) const
+{
+  hash_set pred_cfg_bbs = get_all_predecessors (cfg_bb);
+  for (const basic_block pred_cfg_bb : pred_cfg_bbs)
+{
+  const auto _block_info = vector_block_infos[pred_cfg_bb->index];
+  if (!pred_block_info.local_dem.valid_or_dirty_p ()
+ && !pred_block_info.reaching_out.valid_or_dirty_p ())
+   continue;
+  return false;
+}
+  return true;
+}
+
 bool
 vector_infos_manager::all_same_ratio_p (sbitmap bitdata) const
 {
@@ -3138,6 +3153,14 

Re: [PATCH] update_web_docs_git: Add updated Texinfo to PATH

2023-04-21 Thread Arsen Arsenović via Gcc-patches

Gerald Pfeifer  writes:

> On Thu, 20 Apr 2023, Arsen Arsenović wrote:
>>> I understand, just am wondering whether and why the : is required? I 
>>> don't think we are using this construct anywhere else?
>> Without them, this would happen:
>> 
>>   ~$ "${foo:=foo}"
>>   bash: foo: command not found
>>   ~ 127 $ unset foo
>>   ~$ echo "${foo:=foo}"
>>   foo
>>   ~$ 
>
> Ah, of course!
>
> That's why I tend to use FOO=${FOO-barbar} in such cases - which is a tad 
> more characters. :)
>
>> Thank you!  Hopefully we get this just in time for 13 :)
>
> The release is currently planned for the 26th and the udpated script is 
> now live.

Perfect \o/

> I just ran it and things seem to work just fine. Do you spot anything
> unexpected?

Seems perfect, thank you!

Have a lovely day!  :)

> Gerald


-- 
Arsen Arsenović


signature.asc
Description: PGP signature


Re: [RFA] [PR target/108248] [RISC-V] Break down some bitmanip insn types

2023-04-21 Thread Kito Cheng via Gcc-patches
Hi Robin:

OK, Feel free to commit that to trunk.

and don't forgot to mention this:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109582


On Fri, Apr 21, 2023 at 3:45 PM Robin Dapp via Gcc-patches
 wrote:
>
> > ../../gcc/config/riscv/generic.md:28:1: unknown value `smin' for attribute 
> > `type'
> > make[3]: *** [Makefile:2528: s-attrtab] Error 1
> >
>
> From 582c428258ce17ffac8ef1b96b4072f3d510480f Mon Sep 17 00:00:00 2001
> From: Robin Dapp 
> Date: Fri, 21 Apr 2023 09:38:06 +0200
> Subject: [PATCH] riscv: Fix  fallout.
>
> The adjusted generic.md uses standard names instead of the types defined
> in the  iterator (that match instruction names).  Change this.
>
> gcc/ChangeLog:
>
> * config/riscv/generic.md: Change standard names to insn names.
> ---
>  gcc/config/riscv/generic.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/riscv/generic.md b/gcc/config/riscv/generic.md
> index db4fabbbd92..2c3376628c3 100644
> --- a/gcc/config/riscv/generic.md
> +++ b/gcc/config/riscv/generic.md
> @@ -27,7 +27,7 @@ (define_cpu_unit "fdivsqrt" "pipe0")
>
>  (define_insn_reservation "generic_alu" 1
>(and (eq_attr "tune" "generic")
> -   (eq_attr "type" 
> "unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,smin,smax,umin,umax,clz,ctz,cpop"))
> +   (eq_attr "type" 
> "unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,min,max,minu,maxu,clz,ctz,cpop"))
>"alu")
>
>  (define_insn_reservation "generic_load" 3
> --
> 2.40.0
>
>


[PATCH V2] RISC-V: Fix PR108279

2023-04-21 Thread juzhe . zhong
From: Juzhe-Zhong 

PR 108270


Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.

Consider the following testcase:
void f (void * restrict in, void * restrict out, int l, int n, int m)
{
  for (int i = 0; i < l; i++){
for (int j = 0; j < m; j++){
  for (int k = 0; k < n; k++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
  __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
}
}
  }
}

Compile option: -O3

Before this patch:
mv  a7,a2
mv  a6,a0   
mv  t1,a1
mv  a2,a3
vsetivlizero,17,e8,mf8,ta,ma
...

After this patch:
mv  a7,a2
mv  a6,a0
mv  t1,a1
mv  a2,a3
ble a7,zero,.L1
ble a4,zero,.L1
ble a3,zero,.L1
add a1,a0,a4
li  a0,0
vsetivlizero,17,e8,mf8,ta,ma
...

This issue is a missed optmization produced by Phase 3 global backward demand 
fusion instead of
LCM.

This patch is fixing poor placement of the vsetvl.

This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info 
backward fusion and propogation) which
is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction 
performance.

This patch is to supress the Phase 3 too aggressive backward fusion and 
propagation to the top of the function program
when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli 
instruction allows imm value instead of reg).

You may want to ask why we need Phase 3 to the job. 
Well, we have so many situations that pure LCM fails to optimize, here I can 
show you a simple case to demonstrate it:
void f (void * restrict in, void * restrict out, int n, int m, int cond)
{
  size_t vl = 101;
  for (size_t j = 0; j < m; j++){
if (cond) {
  for (size_t i = 0; i < n; i++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
  __riscv_vse8_v_i8mf8 (out + i, v, vl);
}
} else {
  for (size_t i = 0; i < n; i++)
{
  vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
  v = __riscv_vadd_vv_i32mf2 (v,v,vl);
  __riscv_vse32_v_i32mf2 (out + i, v, vl);
}
}
  }
}

You can see:
The first inner loop needs vsetvli e8 mf8 for vle+vse.
The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.

If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with :

outerloop:
...
vsetvli e8mf8
inner loop 1:


vsetvli e32mf2
inner loop 2:


However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of 
inner loop 2 into vsetvli e8 mf8, then we will end up with this result after 
phase 3:

outerloop:
...
inner loop 1:
vsetvli e32mf2


inner loop 2:
vsetvli e32mf2


Then, this demand information after phase 3 will be well optimized after phase 
4 (LCM), after Phase 4 result is:

vsetvli e32mf2
outerloop:
...
inner loop 1:


inner loop 2:


You can see this is the optimal codegen after current VSETVL PASS (Phase 3: 
Demand backward fusion and propagation + Phase 4: LCM ). This is a known issue 
when I start to implement VSETVL PASS.

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc 
(vector_infos_manager::all_empty_predecessor_p): New function.
(pass_vsetvl::backward_demand_fusion): Ditto.
* config/riscv/riscv-vsetvl.h: Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
* gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.

---
 gcc/config/riscv/riscv-vsetvl.cc  | 23 +++
 gcc/config/riscv/riscv-vsetvl.h   |  2 ++
 .../riscv/rvv/vsetvl/imm_bb_prop-1.c  |  2 +-
 .../riscv/rvv/vsetvl/imm_conflict-3.c |  4 ++--
 .../gcc.target/riscv/rvv/vsetvl/pr108270.c| 19 +++
 5 files changed, 47 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr108270.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 5f424221659..167e3c6145c 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2355,6 +2355,21 @@ vector_infos_manager::get_all_available_exprs (
   return available_list;
 }
 
+bool
+vector_infos_manager::all_empty_predecessor_p (const basic_block cfg_bb) const
+{
+  hash_set pred_cfg_bbs = get_all_predecessors (cfg_bb);
+  for (const basic_block pred_cfg_bb : pred_cfg_bbs)
+{
+  const auto _block_info = vector_block_infos[pred_cfg_bb->index];
+  if (!pred_block_info.local_dem.valid_or_dirty_p ()
+ && !pred_block_info.reaching_out.valid_or_dirty_p ())
+   continue;
+  return false;
+}
+  return true;
+}
+
 bool
 vector_infos_manager::all_same_ratio_p (sbitmap bitdata) const
 {
@@ -3138,6 +3153,14 @@ pass_vsetvl::backward_demand_fusion (void)
   if 

[PATCH] RISC-V: Fix PR108279

2023-04-21 Thread juzhe . zhong
From: Juzhe-Zhong 

PR 108270


Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.

Consider the following testcase:
void f (void * restrict in, void * restrict out, int l, int n, int m)
{
  for (int i = 0; i < l; i++){
for (int j = 0; j < m; j++){
  for (int k = 0; k < n; k++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
  __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
}
}
  }
}

Compile option: -O3

Before this patch:
mv  a7,a2
mv  a6,a0   
mv  t1,a1
mv  a2,a3
vsetivlizero,17,e8,mf8,ta,ma
...

After this patch:
mv  a7,a2
mv  a6,a0
mv  t1,a1
mv  a2,a3
ble a7,zero,.L1
ble a4,zero,.L1
ble a3,zero,.L1
add a1,a0,a4
li  a0,0
vsetivlizero,17,e8,mf8,ta,ma
...

This issue is a missed optmization produced by Phase 3 global backward demand 
fusion instead of
LCM.

This patch is fixing poor placement of the vsetvl.

This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info 
backward fusion and propogation) which
is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction 
performance.

This patch is to supress the Phase 3 too aggressive backward fusion and 
propagation to the top of the function program
when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli 
instruction allows imm value instead of reg).

You may want to ask why we need Phase 3 to the job. 
Well, we have so many situations that pure LCM fails to optimize, here I can 
show you a simple case to demonstrate it:
void f (void * restrict in, void * restrict out, int n, int m, int cond)
{
  size_t vl = 101;
  for (size_t j = 0; j < m; j++){
if (cond) {
  for (size_t i = 0; i < n; i++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
  __riscv_vse8_v_i8mf8 (out + i, v, vl);
}
} else {
  for (size_t i = 0; i < n; i++)
{
  vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
  v = __riscv_vadd_vv_i32mf2 (v,v,vl);
  __riscv_vse32_v_i32mf2 (out + i, v, vl);
}
}
  }
}

You can see:
The first inner loop needs vsetvli e8 mf8 for vle+vse.
The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.

If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with :

outerloop:
...
vsetvli e8mf8
inner loop 1:


vsetvli e32mf2
inner loop 2:


However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of 
inner loop 2 into vsetvli e8 mf8, then we will end up with this result after 
phase 3:

outerloop:
...
inner loop 1:
vsetvli e32mf2


inner loop 2:
vsetvli e32mf2


Then, this demand information after phase 3 will be well optimized after phase 
4 (LCM), after Phase 4 result is:

vsetvli e32mf2
outerloop:
...
inner loop 1:


inner loop 2:


You can see this is the optimal codegen after current VSETVL PASS (Phase 3: 
Demand backward fusion and propagation + Phase 4: LCM ). This is a known issue 
when I start to implement VSETVL PASS.

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc 
(vector_infos_manager::all_empty_predecessor_p): New function.
(pass_vsetvl::backward_demand_fusion): Ditto.
* config/riscv/riscv-vsetvl.h: Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
* gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.

---
 gcc/config/riscv/riscv-vsetvl.cc  | 23 +++
 gcc/config/riscv/riscv-vsetvl.h   |  2 ++
 .../riscv/rvv/vsetvl/imm_bb_prop-1.c  |  2 +-
 .../riscv/rvv/vsetvl/imm_conflict-3.c |  4 ++--
 .../gcc.target/riscv/rvv/vsetvl/pr108270.c| 19 +++
 5 files changed, 47 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr108270.c

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 5f424221659..167e3c6145c 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -2355,6 +2355,21 @@ vector_infos_manager::get_all_available_exprs (
   return available_list;
 }
 
+bool
+vector_infos_manager::all_empty_predecessor_p (const basic_block cfg_bb) const
+{
+  hash_set pred_cfg_bbs = get_all_predecessors (cfg_bb);
+  for (const basic_block pred_cfg_bb : pred_cfg_bbs)
+{
+  const auto _block_info = vector_block_infos[pred_cfg_bb->index];
+  if (!pred_block_info.local_dem.valid_or_dirty_p ()
+ && !pred_block_info.reaching_out.valid_or_dirty_p ())
+   continue;
+  return false;
+}
+  return true;
+}
+
 bool
 vector_infos_manager::all_same_ratio_p (sbitmap bitdata) const
 {
@@ -3138,6 +3153,14 @@ pass_vsetvl::backward_demand_fusion (void)
   if 

[PATCH] Fix LCM dataflow CFG order

2023-04-21 Thread Richard Biener via Gcc-patches
The following fixes the initial order the LCM dataflow routines process
BBs.  For a forward problem you want reverse postorder, for a backward
problem you want reverse postorder on the inverted graph.

The LCM iteration has very many other issues but this allows to
turn inverted_post_order_compute into computing a reverse postorder
more easily.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

* lcm.cc (compute_antinout_edge): Use RPO on the inverted graph.
(compute_laterin): Use RPO.
(compute_available): Likewise.
---
 gcc/lcm.cc | 47 ---
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/gcc/lcm.cc b/gcc/lcm.cc
index d7a86c75cd9..5adb4eb1a11 100644
--- a/gcc/lcm.cc
+++ b/gcc/lcm.cc
@@ -99,16 +99,20 @@ compute_antinout_edge (sbitmap *antloc, sbitmap *transp, 
sbitmap *antin,
   bitmap_vector_ones (antin, last_basic_block_for_fn (cfun));
 
   /* Put every block on the worklist; this is necessary because of the
- optimistic initialization of ANTIN above.  */
-  int *postorder = XNEWVEC (int, n_basic_blocks_for_fn (cfun));
-  int postorder_num = post_order_compute (postorder, false, false);
-  for (int i = 0; i < postorder_num; ++i)
+ optimistic initialization of ANTIN above.  Use reverse postorder
+ on the inverted graph to make the backward dataflow problem require
+ less iterations.  */
+  auto_vec postorder;
+  inverted_post_order_compute ();
+  for (int i = postorder.length () - 1; i >= 0; --i)
 {
   bb = BASIC_BLOCK_FOR_FN (cfun, postorder[i]);
+  if (bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
+ || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+   continue;
   *qin++ = bb;
   bb->aux = bb;
 }
-  free (postorder);
 
   qin = worklist;
   qend = [n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS];
@@ -270,17 +274,15 @@ compute_laterin (struct edge_list *edge_list, sbitmap 
*earliest,
 
   /* Add all the blocks to the worklist.  This prevents an early exit from
  the loop given our optimistic initialization of LATER above.  */
-  auto_vec postorder;
-  inverted_post_order_compute ();
-  for (unsigned int i = 0; i < postorder.length (); ++i)
+  int *rpo = XNEWVEC (int, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
+  int n = pre_and_rev_post_order_compute_fn (cfun, NULL, rpo, false);
+  for (int i = 0; i < n; ++i)
 {
-  bb = BASIC_BLOCK_FOR_FN (cfun, postorder[i]);
-  if (bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
- || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
-   continue;
+  bb = BASIC_BLOCK_FOR_FN (cfun, rpo[i]);
   *qin++ = bb;
   bb->aux = bb;
 }
+  free (rpo);
 
   /* Note that we do not use the last allocated element for our queue,
  as EXIT_BLOCK is never inserted into it. */
@@ -298,13 +300,14 @@ compute_laterin (struct edge_list *edge_list, sbitmap 
*earliest,
   if (qout >= qend)
qout = worklist;
 
-  /* Compute the intersection of LATERIN for each incoming edge to B.  */
+  /* Compute LATERIN as the intersection of LATER for each incoming
+edge to BB.  */
   bitmap_ones (laterin[bb->index]);
   FOR_EACH_EDGE (e, ei, bb->preds)
bitmap_and (laterin[bb->index], laterin[bb->index],
later[(size_t)e->aux]);
 
-  /* Calculate LATER for all outgoing edges.  */
+  /* Calculate LATER for all outgoing edges of BB.  */
   FOR_EACH_EDGE (e, ei, bb->succs)
if (bitmap_ior_and_compl (later[(size_t) e->aux],
  earliest[(size_t) e->aux],
@@ -509,19 +512,17 @@ compute_available (sbitmap *avloc, sbitmap *kill, sbitmap 
*avout,
   bitmap_vector_ones (avout, last_basic_block_for_fn (cfun));
 
   /* Put every block on the worklist; this is necessary because of the
- optimistic initialization of AVOUT above.  Use inverted postorder
- to make the dataflow problem require less iterations.  */
-  auto_vec postorder;
-  inverted_post_order_compute ();
-  for (unsigned int i = 0; i < postorder.length (); ++i)
+ optimistic initialization of AVOUT above.  Use reverse postorder
+ to make the forward dataflow problem require less iterations.  */
+  int *rpo = XNEWVEC (int, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
+  int n = pre_and_rev_post_order_compute_fn (cfun, NULL, rpo, false);
+  for (int i = 0; i < n; ++i)
 {
-  bb = BASIC_BLOCK_FOR_FN (cfun, postorder[i]);
-  if (bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
- || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
-   continue;
+  bb = BASIC_BLOCK_FOR_FN (cfun, rpo[i]);
   *qin++ = bb;
   bb->aux = bb;
 }
+  free (rpo);
 
   qin = worklist;
   qend = [n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS];
-- 
2.35.3


Re: [RFA] [PR target/108248] [RISC-V] Break down some bitmanip insn types

2023-04-21 Thread Robin Dapp via Gcc-patches
> ../../gcc/config/riscv/generic.md:28:1: unknown value `smin' for attribute 
> `type'
> make[3]: *** [Makefile:2528: s-attrtab] Error 1
> 

>From 582c428258ce17ffac8ef1b96b4072f3d510480f Mon Sep 17 00:00:00 2001
From: Robin Dapp 
Date: Fri, 21 Apr 2023 09:38:06 +0200
Subject: [PATCH] riscv: Fix  fallout.

The adjusted generic.md uses standard names instead of the types defined
in the  iterator (that match instruction names).  Change this.

gcc/ChangeLog:

* config/riscv/generic.md: Change standard names to insn names.
---
 gcc/config/riscv/generic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/riscv/generic.md b/gcc/config/riscv/generic.md
index db4fabbbd92..2c3376628c3 100644
--- a/gcc/config/riscv/generic.md
+++ b/gcc/config/riscv/generic.md
@@ -27,7 +27,7 @@ (define_cpu_unit "fdivsqrt" "pipe0")
 
 (define_insn_reservation "generic_alu" 1
   (and (eq_attr "tune" "generic")
-   (eq_attr "type" 
"unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,smin,smax,umin,umax,clz,ctz,cpop"))
+   (eq_attr "type" 
"unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,min,max,minu,maxu,clz,ctz,cpop"))
   "alu")
 
 (define_insn_reservation "generic_load" 3
-- 
2.40.0




Re: [RFA] [PR target/108248] [RISC-V] Break down some bitmanip insn types

2023-04-21 Thread Andreas Schwab
../../gcc/config/riscv/generic.md:28:1: unknown value `smin' for attribute 
`type'
make[3]: *** [Makefile:2528: s-attrtab] Error 1

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [pushed][PATCH] LoongArch: fix MUSL_DYNAMIC_LINKER

2023-04-21 Thread Lulu Cheng

Pushed to r14-130.

在 2023/4/19 下午4:23, Peng Fan 写道:

The system based on musl has no '/lib64', so change it.

https://wiki.musl-libc.org/guidelines-for-distributions.html,
"Multilib/multi-arch" section of this introduces it.

gcc/
 * config/loongarch/gnu-user.h (MUSL_DYNAMIC_LINKER: Redefine.)

Signed-off-by: Peng Fan 
Suggested-by: Xi Ruoyao 
---
  gcc/config/loongarch/gnu-user.h | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/config/loongarch/gnu-user.h b/gcc/config/loongarch/gnu-user.h
index aecaa02a199..fa1a5211419 100644
--- a/gcc/config/loongarch/gnu-user.h
+++ b/gcc/config/loongarch/gnu-user.h
@@ -33,9 +33,14 @@ along with GCC; see the file COPYING3.  If not see
  #define GLIBC_DYNAMIC_LINKER \
"/lib" ABI_GRLEN_SPEC "/ld-linux-loongarch-" ABI_SPEC ".so.1"
  
+#define MUSL_ABI_SPEC \

+  "%{mabi=lp64d:-lp64d}" \
+  "%{mabi=lp64f:-lp64f}" \
+  "%{mabi=lp64s:-lp64s}"
+
  #undef MUSL_DYNAMIC_LINKER
  #define MUSL_DYNAMIC_LINKER \
-  "/lib" ABI_GRLEN_SPEC "/ld-musl-loongarch-" ABI_SPEC ".so.1"
+  "/lib/ld-musl-loongarch" ABI_GRLEN_SPEC MUSL_ABI_SPEC ".so.1"
  
  #undef GNU_USER_TARGET_LINK_SPEC

  #define GNU_USER_TARGET_LINK_SPEC \




Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

2023-04-21 Thread Prathamesh Kulkarni via Gcc-patches
On Wed, 12 Apr 2023 at 14:29, Richard Sandiford
 wrote:
>
> Prathamesh Kulkarni  writes:
> > On Thu, 6 Apr 2023 at 16:05, Richard Sandiford
> >  wrote:
> >>
> >> Prathamesh Kulkarni  writes:
> >> > On Tue, 4 Apr 2023 at 23:35, Richard Sandiford
> >> >  wrote:
> >> >> > diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> >> >> > b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> >> >> > index cd9cace3c9b..3de79060619 100644
> >> >> > --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> >> >> > +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> >> >> > @@ -817,6 +817,62 @@ public:
> >> >> >
> >> >> >  class svdupq_impl : public quiet
> >> >> >  {
> >> >> > +private:
> >> >> > +  gimple *
> >> >> > +  fold_nonconst_dupq (gimple_folder , unsigned factor) const
> >> >> > +  {
> >> >> > +/* Lower lhs = svdupq (arg0, arg1, ..., argN} into:
> >> >> > +   tmp = {arg0, arg1, ..., arg}
> >> >> > +   lhs = VEC_PERM_EXPR (tmp, tmp, {0, 1, 2, N-1, ...})  */
> >> >> > +
> >> >> > +/* TODO: Revisit to handle factor by padding zeros.  */
> >> >> > +if (factor > 1)
> >> >> > +  return NULL;
> >> >>
> >> >> Isn't the key thing here predicate vs. vector rather than factor == 1 
> >> >> vs.
> >> >> factor != 1?  Do we generate good code for b8, where factor should be 1?
> >> > Hi,
> >> > It generates the following code for svdup_n_b8:
> >> > https://pastebin.com/ypYt590c
> >>
> >> Hmm, yeah, not pretty :-)  But it's not pretty without either.
> >>
> >> > I suppose lowering to ctor+vec_perm_expr is not really useful
> >> > for this case because it won't simplify ctor, unlike the above case of
> >> > svdupq_s32 (x[0], x[1], x[2], x[3]);
> >> > However I wonder if it's still a good idea to lower svdupq for 
> >> > predicates, for
> >> > representing svdupq (or other intrinsics) using GIMPLE constructs as
> >> > far as possible ?
> >>
> >> It's possible, but I think we'd need an example in which its a clear
> >> benefit.
> > Sorry I posted for wrong test case above.
> > For the following test:
> > svbool_t f(uint8x16_t x)
> > {
> >   return svdupq_n_b8 (x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7],
> > x[8], x[9], x[10], x[11], x[12],
> > x[13], x[14], x[15]);
> > }
> >
> > Code-gen:
> > https://pastebin.com/maexgeJn
> >
> > I suppose it's equivalent to following ?
> >
> > svbool_t f2(uint8x16_t x)
> > {
> >   svuint8_t tmp = svdupq_n_u8 ((bool) x[0], (bool) x[1], (bool) x[2],
> > (bool) x[3],
> >(bool) x[4], (bool) x[5], (bool) x[6],
> > (bool) x[7],
> >(bool) x[8], (bool) x[9], (bool) x[10],
> > (bool) x[11],
> >(bool) x[12], (bool) x[13], (bool)
> > x[14], (bool) x[15]);
> >   return svcmpne_n_u8 (svptrue_b8 (), tmp, 0);
> > }
>
> Yeah, this is essentially the transformation that the svdupq rtl
> expander uses.  It would probably be a good idea to do that in
> gimple too.
Hi,
I tested the interleave+zip1 for vector init patch and it segfaulted
during bootstrap while trying to build
libgfortran/generated/matmul_i2.c.
Rebuilding with --enable-checking=rtl showed out of bounds access in
aarch64_unzip_vector_init in following hunk:

+  rtvec vec = rtvec_alloc (n / 2);
+  for (int i = 0; i < n; i++)
+RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
+ : XVECEXP (vals, 0, 2 * i + 1);

which is incorrect since it allocates n/2 but iterates and stores upto n.
The attached patch fixes the issue, which passed bootstrap, however
resulted in following fallout during testsuite run:

1] sve/acle/general/dupq_[1-4].c tests fail.
For the following test:
int32x4_t f(int32_t x)
{
  return (int32x4_t) { x, 1, 2, 3 };
}

Code-gen without patch:
f:
adrpx1, .LC0
ldr q0, [x1, #:lo12:.LC0]
ins v0.s[0], w0
ret

Code-gen with patch:
f:
moviv0.2s, 0x2
adrpx1, .LC0
ldr d1, [x1, #:lo12:.LC0]
ins v0.s[0], w0
zip1v0.4s, v0.4s, v1.4s
ret

It shows, fallback_seq_cost = 20, seq_total_cost = 16
where seq_total_cost determines the cost for interleave+zip1 sequence
and fallback_seq_cost is the cost for fallback sequence.
Altho it shows lesser cost, I am not sure if the interleave+zip1
sequence is better in this case ?

2] sve/acle/general/dupq_[5-6].c tests fail:
int32x4_t f(int32_t x0, int32_t x1, int32_t x2, int32_t x3)
{
  return (int32x4_t) { x0, x1, x2, x3 };
}

code-gen without patch:
f:
fmovs0, w0
ins v0.s[1], w1
ins v0.s[2], w2
ins v0.s[3], w3
ret

code-gen with patch:
f:
fmovs0, w0
fmovs1, w1
ins v0.s[1], w2
ins v1.s[1], w3
zip1v0.4s, v0.4s, v1.4s
ret

It shows fallback_seq_cost = 28, seq_total_cost = 16

3] aarch64/ldp_stp_16.c's cons2_8_float test fails.
Test case:
void cons2_8_float(float *x, float 

Re: [PATCH] update_web_docs_git: Add updated Texinfo to PATH

2023-04-21 Thread Gerald Pfeifer
On Thu, 20 Apr 2023, Arsen Arsenović wrote:
>> I understand, just am wondering whether and why the : is required? I 
>> don't think we are using this construct anywhere else?
> Without them, this would happen:
> 
>   ~$ "${foo:=foo}"
>   bash: foo: command not found
>   ~ 127 $ unset foo
>   ~$ echo "${foo:=foo}"
>   foo
>   ~$ 

Ah, of course!

That's why I tend to use FOO=${FOO-barbar} in such cases - which is a tad 
more characters. :)

> Thank you!  Hopefully we get this just in time for 13 :)

The release is currently planned for the 26th and the udpated script is 
now live.

I just ran it and things seem to work just fine. Do you spot anything
unexpected?

Gerald


[PATCH] RISC-V: Support segment intrinsics

2023-04-21 Thread juzhe . zhong
From: Juzhe-Zhong 

gcc/ChangeLog:

* config/riscv/riscv-vector-builtins-bases.cc (fold_fault_load): New 
function.
(class vlseg): New class.
(class vsseg): Ditto.
(class vlsseg): Ditto.
(class vssseg): Ditto.
(class seg_indexed_load): Ditto.
(class seg_indexed_store): Ditto.
(class vlsegff): Ditto.
(BASE): Ditto.
* config/riscv/riscv-vector-builtins-bases.h: Ditto.
* config/riscv/riscv-vector-builtins-functions.def (vlseg): Ditto.
(vsseg): Ditto.
(vlsseg): Ditto.
(vssseg): Ditto.
(vluxseg): Ditto.
(vloxseg): Ditto.
(vsuxseg): Ditto.
(vsoxseg): Ditto.
(vlsegff): Ditto.
* config/riscv/riscv-vector-builtins-shapes.cc (struct 
seg_loadstore_def): Ditto.
(struct seg_indexed_loadstore_def): Ditto.
(struct seg_fault_load_def): Ditto.
(SHAPE): Ditto.
* config/riscv/riscv-vector-builtins-shapes.h: Ditto.
* config/riscv/riscv-vector-builtins.cc (function_builder::append_nf): 
New function.
* config/riscv/riscv-vector-builtins.def (vfloat32m1x2_t): Change ptr 
from double into float.
(vfloat32m1x3_t): Ditto.
(vfloat32m1x4_t): Ditto.
(vfloat32m1x5_t): Ditto.
(vfloat32m1x6_t): Ditto.
(vfloat32m1x7_t): Ditto.
(vfloat32m1x8_t): Ditto.
(vfloat32m2x2_t): Ditto.
(vfloat32m2x3_t): Ditto.
(vfloat32m2x4_t): Ditto.
(vfloat32m4x2_t): Ditto.
* config/riscv/riscv-vector-builtins.h: Add segment intrinsics.
* config/riscv/riscv-vsetvl.cc (fault_first_load_p): Adapt for segment 
ff load.
* config/riscv/riscv.md: Add segment instructions.
* config/riscv/vector-iterators.md: Support segment intrinsics.
* config/riscv/vector.md (@pred_unit_strided_load): New pattern.
(@pred_unit_strided_store): Ditto.
(@pred_strided_load): Ditto.
(@pred_strided_store): Ditto.
(@pred_fault_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_load): Ditto.
(@pred_indexed_store): Ditto.
(@pred_indexed_store): Ditto.
(@pred_indexed_store): Ditto.
(@pred_indexed_store): Ditto.
(@pred_indexed_store): Ditto.
(@pred_indexed_store): Ditto.
(@pred_indexed_store): Ditto.

---
 .../riscv/riscv-vector-builtins-bases.cc  | 270 +++--
 .../riscv/riscv-vector-builtins-bases.h   |   9 +
 .../riscv/riscv-vector-builtins-functions.def |  21 +
 .../riscv/riscv-vector-builtins-shapes.cc | 139 +
 .../riscv/riscv-vector-builtins-shapes.h  |   3 +
 gcc/config/riscv/riscv-vector-builtins.cc | 136 +
 gcc/config/riscv/riscv-vector-builtins.def|  22 +-
 gcc/config/riscv/riscv-vector-builtins.h  |   1 +
 gcc/config/riscv/riscv-vsetvl.cc  |   4 +-
 gcc/config/riscv/riscv.md |  10 +
 gcc/config/riscv/vector-iterators.md  | 280 +
 gcc/config/riscv/vector.md| 547 --
 12 files changed, 1324 insertions(+), 118 deletions(-)

diff --git a/gcc/config/riscv/riscv-vector-builtins-bases.cc 
b/gcc/config/riscv/riscv-vector-builtins-bases.cc
index 8693b2887fb..ab5b4dc9515 100644
--- a/gcc/config/riscv/riscv-vector-builtins-bases.cc
+++ b/gcc/config/riscv/riscv-vector-builtins-bases.cc
@@ -58,6 +58,54 @@ enum lst_type
   LST_INDEXED,
 };
 
+/* Helper function to fold vleff and vlsegff.  */
+static gimple *
+fold_fault_load (gimple_folder )
+{
+  /* fold fault_load (const *base, size_t *new_vl, size_t vl)
+
+ > fault_load (const *base, size_t vl)
+  new_vl = MEM_REF[read_vl ()].  */
+
+  auto_vec vargs (gimple_call_num_args (f.call) - 1);
+
+  for (unsigned i = 0; i < gimple_call_num_args (f.call); i++)
+{
+  /* Exclude size_t *new_vl argument.  */
+  if (i == gimple_call_num_args (f.call) - 2)
+   continue;
+
+  vargs.quick_push (gimple_call_arg (f.call, i));
+}
+
+  gimple *repl = gimple_build_call_vec (gimple_call_fn (f.call), vargs);
+  gimple_call_set_lhs (repl, f.lhs);
+
+  /* Handle size_t *new_vl by read_vl.  */
+  tree new_vl = gimple_call_arg (f.call, gimple_call_num_args (f.call) - 2);
+  if (integer_zerop (new_vl))
+{
+  /* This case happens when user passes the nullptr to new_vl argument.
+In this case, we just need to ignore the new_vl argument and return
+fault_load instruction directly. */
+  return repl;
+}
+
+  tree tmp_var = create_tmp_var (size_type_node, "new_vl");
+  tree decl = get_read_vl_decl ();
+  gimple *g = gimple_build_call (decl, 0);
+  gimple_call_set_lhs (g, tmp_var);
+  tree indirect
+= fold_build2 (MEM_REF, size_type_node,
+  

Re: [PATCH] Implement range-op entry for sin/cos.

2023-04-21 Thread Jakub Jelinek via Gcc-patches
On Thu, Apr 20, 2023 at 09:14:10PM -0400, Siddhesh Poyarekar wrote:
> On 2023-04-20 13:57, Siddhesh Poyarekar wrote:
> > For bounds that aren't representable, one could get error bounds from
> > libm-test-ulps data in glibc, although I reckon those won't be
> > exhaustive.  From a quick peek at the sin/cos data, the arc target seems
> > to be among the worst performers at about 7ulps, although if you include
> > the complex routines we get close to 13 ulps.  The very worst
> > imprecision among all math routines (that's gamma) is at 16 ulps for
> > power in glibc tests, so maybe allowing about 25-30 ulps error in bounds
> > might work across the board.
> 
> I was thinking about this a bit more and it seems like limiting ranges to
> targets that can generate sane results (i.e. error bounds within, say, 5-6
> ulps) and for the rest, avoid emitting the ranges altogether. Emitting a bad
> range for all architectures seems like a net worse solution again.

Well, at least for basic arithmetics when libm functions aren't involved,
there is no point in disabling ranges altogether.
And, for libm functions, my plan was to introduce a target hook, which
would have combined_fn argument to tell which function is queried,
machine_mode to say which floating point format and perhaps a bool whether
it is ulps for these basic math boundaries or results somewhere in between,
and would return in unsigned int ulps, 0 for 0.5ulps precision.
So, we could say for CASE_CFN_SIN: CASE_CFN_COS: in the glibc handler
say that ulps is say 3 inside of the ranges and 0 on the boundaries if
!flag_rounding_math and 6 and 2 with flag_rounding_math or whatever.
And in the generic code don't assume anything if ulps is say 100 or more.
The hooks would need to be a union of precision of supported versions of
the library through the history, including say libmvec because function
calls could be vectorized.
And default could be that infinite precision.
Back in November I've posted a proglet that can generate some ulps from
random number testing, plus on glibc we could pick maximums from ulps files.
And if needed, say powerpc*-linux could override the generic glibc
version for some subset of functions and call default otherwise (say at
least for __ibm128).

Jakub



Re: [committed v2] RISC-V: Add local user vsetvl instruction elimination [PR109547]

2023-04-21 Thread juzhe.zh...@rivai.ai
LGTM。



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-04-21 14:49
To: gcc-patches
CC: Juzhe-Zhong
Subject: [committed v2] RISC-V: Add local user vsetvl instruction elimination 
[PR109547]
From: Juzhe-Zhong 
 
This patch is to enhance optimization for auto-vectorization.
 
Before this patch:
 
Loop:
vsetvl a5,a2...
vsetvl zero,a5...
vle
 
After this patch:
 
Loop:
vsetvl a5,a2
vle
 
gcc/ChangeLog:
 
PR target/109547
* config/riscv/riscv-vsetvl.cc (local_eliminate_vsetvl_insn): New function.
(vector_insn_info::skip_avl_compatible_p): Ditto.
(vector_insn_info::merge): Remove default value.
(pass_vsetvl::compute_local_backward_infos): Ditto.
(pass_vsetvl::cleanup_insns): Add local vsetvl elimination.
* config/riscv/riscv-vsetvl.h: Ditto.
 
gcc/testsuite/ChangeLog:
 
PR target/109547
* gcc.target/riscv/rvv/vsetvl/pr109547.c: New.
* gcc.target/riscv/rvv/vsetvl/vsetvl-17.c: Update scan
condition.
---
gcc/config/riscv/riscv-vsetvl.cc  | 71 ++-
gcc/config/riscv/riscv-vsetvl.h   |  1 +
.../gcc.target/riscv/rvv/vsetvl/pr109547.c| 14 
.../gcc.target/riscv/rvv/vsetvl/vsetvl-17.c   |  2 +-
4 files changed, 85 insertions(+), 3 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr109547.c
 
diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 9c356ce51579..2406931dac01 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -1054,6 +1054,51 @@ change_vsetvl_insn (const insn_info *insn, const 
vector_insn_info )
   change_insn (rinsn, new_pat);
}
+static void
+local_eliminate_vsetvl_insn (const vector_insn_info )
+{
+  const insn_info *insn = dem.get_insn ();
+  if (!insn || insn->is_artificial ())
+return;
+  rtx_insn *rinsn = insn->rtl ();
+  const bb_info *bb = insn->bb ();
+  if (vsetvl_insn_p (rinsn))
+{
+  rtx vl = get_vl (rinsn);
+  for (insn_info *i = insn->next_nondebug_insn ();
+real_insn_and_same_bb_p (i, bb); i = i->next_nondebug_insn ())
+ {
+   if (i->is_call () || i->is_asm ()
+   || find_access (i->defs (), VL_REGNUM)
+   || find_access (i->defs (), VTYPE_REGNUM))
+ return;
+
+   if (has_vtype_op (i->rtl ()))
+ {
+   if (!vsetvl_discard_result_insn_p (PREV_INSN (i->rtl (
+ return;
+   rtx avl = get_avl (i->rtl ());
+   if (avl != vl)
+ return;
+   set_info *def = find_access (i->uses (), REGNO (avl))->def ();
+   if (def->insn () != insn)
+ return;
+
+   vector_insn_info new_info;
+   new_info.parse_insn (i);
+   if (!new_info.skip_avl_compatible_p (dem))
+ return;
+
+   new_info.set_avl_info (dem.get_avl_info ());
+   new_info = dem.merge (new_info, LOCAL_MERGE);
+   change_vsetvl_insn (insn, new_info);
+   eliminate_insn (PREV_INSN (i->rtl ()));
+   return;
+ }
+ }
+}
+}
+
static bool
source_equal_p (insn_info *insn1, insn_info *insn2)
{
@@ -1996,6 +2041,19 @@ vector_insn_info::compatible_p (const vector_insn_info 
) const
   return true;
}
+bool
+vector_insn_info::skip_avl_compatible_p (const vector_insn_info ) const
+{
+  gcc_assert (valid_or_dirty_p () && other.valid_or_dirty_p ()
+   && "Can't compare invalid demanded infos");
+  unsigned array_size = sizeof (incompatible_conds) / sizeof (demands_cond);
+  /* Bypass AVL incompatible cases.  */
+  for (unsigned i = 1; i < array_size; i++)
+if (incompatible_conds[i].dual_incompatible_p (*this, other))
+  return false;
+  return true;
+}
+
bool
vector_insn_info::compatible_avl_p (const vl_vtype_info ) const
{
@@ -2190,7 +2248,7 @@ vector_insn_info::fuse_mask_policy (const 
vector_insn_info ,
vector_insn_info
vector_insn_info::merge (const vector_insn_info _info,
- enum merge_type type = LOCAL_MERGE) const
+ enum merge_type type) const
{
   if (!vsetvl_insn_p (get_insn ()->rtl ()))
 gcc_assert (this->compatible_p (merge_info)
@@ -2696,7 +2754,7 @@ pass_vsetvl::compute_local_backward_infos (const bb_info 
*bb)
&& !reg_available_p (insn, change))
  && change.compatible_p (info))
{
-   info = change.merge (info);
+   info = change.merge (info, LOCAL_MERGE);
  /* Fix PR109399, we should update user vsetvl instruction
 if there is a change in demand fusion.  */
  if (vsetvl_insn_p (insn->rtl ()))
@@ -3925,6 +3983,15 @@ pass_vsetvl::cleanup_insns (void) const
   for (insn_info *insn : bb->real_nondebug_insns ())
{
  rtx_insn *rinsn = insn->rtl ();
+   const auto  = m_vector_manager->vector_insn_infos[insn->uid ()];
+   /* Eliminate local vsetvl:
+bb 0:
+vsetvl a5,a6,...
+vsetvl zero,a5.
+
+  Eliminate vsetvl in bb2 when a5 is only coming from
+  bb 0.  */
+   local_eliminate_vsetvl_insn (dem);
  if (vlmax_avl_insn_p (rinsn))
{
diff --git a/gcc/config/riscv/riscv-vsetvl.h b/gcc/config/riscv/riscv-vsetvl.h
index 237381f7026b..4fe08cfc789d 100644
--- a/gcc/config/riscv/riscv-vsetvl.h
+++ b/gcc/config/riscv/riscv-vsetvl.h
@@ -380,6 +380,7 @@ public:
   void 

  1   2   >