Re: [PATCH] RISC-V: Support combine cond extend and reduce sum to cond widen reduce sum

2023-09-18 Thread Robin Dapp via Gcc-patches
Hi Lehua,

> +(define_expand "vcond_mask_"
> +  [(set (match_operand:V_VLS 0 "register_operand")
> +(if_then_else:V_VLS
> +  (match_operand: 3 "register_operand")
> +  (match_operand:V_VLS 1 "nonmemory_operand")
> +  (match_operand:V_VLS 2 "vector_register_or_const_0_operand")))]
> +  "TARGET_VECTOR"

Would it hurt to allow any nonmemory operand here and just force the
"unsupported" constants into a register?

> +  {
> +if (satisfies_constraint_Wc0 (operands[2]))
> +  {
> +rtx reg = gen_reg_rtx (mode);
> +emit_insn (gen_vec_duplicate_const_0 (reg, operands[2]));

Can't we emit a move_insn directly without going through the new pattern?
Or will that be optimized away in between?  And the new pattern isn't
actually a duplicate but a move anyway so maybe a force_reg (operands[2]) would
already do?  Probably initial values other than 0 don't work out of the box?

In any case it wouldn't hurt to describe the "design decisions" (i.e. why
we need one thing and not another) so it's easier to follow the patterns
in the future.

Regards
 Robin


Re: [PATCH] RISC-V: Remove redundant vec_duplicate pattern

2023-09-18 Thread Robin Dapp via Gcc-patches
LGTM.

Regards
 Robin



Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-09-18 Thread Robin Dapp via Gcc-patches
Ping.

Regards
 Robin


Re: [PATCH v1] RISC-V: Bugfix for scalar move with merged operand

2023-09-18 Thread Robin Dapp via Gcc-patches
> I must be missing something.  Doesn't insn 10 broadcast the immediate
> 0x2 to both elements of r142?!?  What am I missing?
It is indeed a bit misleading.  The difference is in the mask which
is not displayed in the short form.  So we actually use a vec_dup
for a single-element move, essentially a masked vec_dup where only
one element is masked in.

The problem was that the original doesn't use a merging "vec_set"
but a "destructive" one where the other elements get ignored.

The fix is OK IMHO. 

Regards
 Robin


Re: [PATCH V4] RISC-V: Expand VLS mode to scalar mode move[PR111391]

2023-09-15 Thread Robin Dapp via Gcc-patches
> You mean this patch is ok?

I thought about it a bit more.  From my point of view the patch is OK
for now in order to get the bug out of the way.

In the longer term I would really prefer a more "regular" solution
(i.e. via hard_regno_mode_ok) and related.  I can take care of that
once I have a bit of time but for now let's go ahead.

Regards
 Robin


Re: Machine Mode ICE in RISC-V when LTO

2023-09-15 Thread Robin Dapp via Gcc-patches
Hi Thomas, Jakub,

is there anything we can do to assist from the riscv side in order to help
with this?  I haven't really been involved with it but was wondering
what's missing.  If I understand correctly Thomas has a major cleanup
operation in plan but might not get to it soon.  The fix he proposed
helps for the riscv case, however, even without the rework?

If so, I'd kindly ping Jakub to check if the fix is reasonable.

Thank you.

Regards
 Robin


Re: [PATCH V4] RISC-V: Expand VLS mode to scalar mode move[PR111391]

2023-09-14 Thread Robin Dapp via Gcc-patches
> I am thinking what we are doing is something like we are allowing
> scalar mode within the vector register, so...not sure should we try to
> implement that within the mov pattern?
> 
> I guess we need some inputs from Jeff.

Sorry for the late response.  I have also been thinking about this and
it feels a bit like a bandaid to me.  Usually register-class moves like
this are performed by reload (which consults register_move_costs among
other things) and we are working around it.

The situation is that we move a vec_duplicate of QImodes into a vector
register.  Then we want to use this as scalar call argument so we need
to transfer it back to a DImode register.

One maybe more typical solution would be to allow small VLS vector modes
like V8QI in GPRs (via hard_regno_mode_ok) until reload so we could have
a (set (reg:V8QI a0) (vec_duplicate:V8QI ...)).

The next step would be to have a mov expander with target "r"
constraint (and source "vr") that performs the actual move.  This is
where Juzhe's mov code could fit in (without the subreg handling).
If I'm not mistaken vmv.x.s without slidedown should be sufficient for
our case as we'd only want to use the whole thing when the full vector
fits into a GPR. 

All that's missing is a (reinterpreting) vtype change to Pmode-sized
elements before. I quickly hacked something together (without the proper
mode change) and the resulting code looks like:

vsetvli zero, 8, e8, ...
vmv.v.x v1,a5
# missing vsetivli zero, 1, e64, ... or something 
vmv.x.s a0,v1

Now, whether that's efficient (and desirable) is a separate issue and
should probably be defined by register_move_costs as well as instruction
costs.  I wasn't actually aware of this call/argument optimization that
uses vec_duplicate and I haven't checked what costing (if at all) it
uses.

Regards
 Robin


Re: [PATCH] RISC-V: Support VLS modes VEC_EXTRACT auto-vectorization

2023-09-13 Thread Robin Dapp via Gcc-patches
> Yes. We need the additional helper function since I will cal emit_insn 
> (gen_vec_extract (mode, mode)
> in the following patch which fixes PR111391 ICE.

OK.

Regards
 Robin



Re: [PATCH] RISC-V: Support VLS modes VEC_EXTRACT auto-vectorization

2023-09-13 Thread Robin Dapp via Gcc-patches
> -(define_expand "vec_extract"
> +(define_expand "@vec_extract"

Do we need the additional helper function?  If not let's rather not
add them for build-time reasons.  The rest is OK, no need for v2.

Regards
 Robin


Re: [PATCH V6] RISC-V: Enable vec_int testsuite for RVV VLA vectorization

2023-09-12 Thread Robin Dapp via Gcc-patches
> Most (all?) of those are due to:
> f951: Warning: command-line option '-Wno-psabi' is valid for 
> C/C++/D/LTO/ObjC/ObjC++ but not for Fortran
> so no real bug.

When pushing this, I'd take the liberty of enabling the recently merged vector
ABI so we don't require -Wno-psabi anymore.  All Fortran FAILs disappear and
nothing else changes.

--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -11166,12 +11166,12 @@ proc check_vect_support_and_set_flags { } {
 } elseif [istarget riscv64-*-*] {
if [check_effective_target_riscv_vector_hw] {
lappend DEFAULT_VECTCFLAGS "--param" 
"riscv-autovec-preference=scalable"
-   lappend DEFAULT_VECTCFLAGS "-Wno-psabi"
+   lappend DEFAULT_VECTCFLAGS "--param" "riscv-vector-abi"
set dg-do-what-default run
} else {
lappend DEFAULT_VECTCFLAGS "-march=rv64gcv_zvfh" "-mabi=lp64d"
lappend DEFAULT_VECTCFLAGS "--param" 
"riscv-autovec-preference=scalable"
-   lappend DEFAULT_VECTCFLAGS "-Wno-psabi"
+   lappend DEFAULT_VECTCFLAGS "--param" "riscv-vector-abi"
set dg-do-what-default compile
}
 } else {

Regards
 Robin



Re: [PATCH V6] RISC-V: Enable vec_int testsuite for RVV VLA vectorization

2023-09-12 Thread Robin Dapp via Gcc-patches
The current status (for rv64gcv) is:

=== gcc tests ===

Running target unix/-march=rv64gcv
XPASS: gcc.dg/vect/bb-slp-subgroups-3.c -flto -ffat-lto-objects  
scan-tree-dump-times slp2 "optimized: basic block" 2
XPASS: gcc.dg/vect/bb-slp-subgroups-3.c scan-tree-dump-times slp2 "optimized: 
basic block" 2
XPASS: gcc.dg/vect/no-scevccp-outer-16.c scan-tree-dump-times vect "OUTER LOOP 
VECTORIZED." 1
XPASS: gcc.dg/vect/no-scevccp-outer-17.c scan-tree-dump-times vect "OUTER LOOP 
VECTORIZED." 1
XPASS: gcc.dg/vect/no-scevccp-outer-19.c scan-tree-dump-times vect "OUTER LOOP 
VECTORIZED." 1
XPASS: gcc.dg/vect/no-scevccp-outer-21.c scan-tree-dump-times vect "OUTER LOOP 
VECTORIZED." 1
FAIL: gcc.dg/vect/no-scevccp-outer-7.c scan-tree-dump-times vect 
"vect_recog_widen_mult_pattern: detected" 1
FAIL: gcc.dg/vect/no-scevccp-vect-iv-3.c scan-tree-dump-times vect 
"vect_recog_widen_sum_pattern: detected" 1
FAIL: gcc.dg/vect/pr57705.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 1 loop" 2
FAIL: gcc.dg/vect/pr57705.c scan-tree-dump-times vect "vectorized 1 loop" 2
FAIL: gcc.dg/vect/pr65518.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 0 loops in function" 2
FAIL: gcc.dg/vect/pr65518.c scan-tree-dump-times vect "vectorized 0 loops in 
function" 2
FAIL: gcc.dg/vect/slp-1.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 4
FAIL: gcc.dg/vect/slp-1.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 4
FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 1
FAIL: gcc.dg/vect/slp-16.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 2
FAIL: gcc.dg/vect/slp-16.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 2
FAIL: gcc.dg/vect/slp-34-big-array.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorizing stmts using SLP" 2
FAIL: gcc.dg/vect/slp-34-big-array.c scan-tree-dump-times vect "vectorizing 
stmts using SLP" 2
FAIL: gcc.dg/vect/slp-34.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 2
FAIL: gcc.dg/vect/slp-34.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 2
FAIL: gcc.dg/vect/slp-35.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorizing stmts using SLP" 1
FAIL: gcc.dg/vect/slp-35.c scan-tree-dump-times vect "vectorizing stmts using 
SLP" 1
XPASS: gcc.dg/vect/slp-reduc-3.c -flto -ffat-lto-objects  scan-tree-dump-times 
vect "vectorizing stmts using SLP" 1
XPASS: gcc.dg/vect/slp-reduc-3.c scan-tree-dump-times vect "vectorizing stmts 
using SLP" 1
FAIL: gcc.dg/vect/slp-reduc-4.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorizing stmts using SLP"
FAIL: gcc.dg/vect/slp-reduc-4.c scan-tree-dump vect "vectorizing stmts using 
SLP"
FAIL: gcc.dg/vect/slp-reduc-7.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/slp-reduc-7.c execution test
XPASS: gcc.dg/vect/vect-24.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
XPASS: gcc.dg/vect/vect-24.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-alias-check-4.c  (test for warnings, line 34)
FAIL: gcc.dg/vect/vect-alias-check-4.c  at line 19 (test for warnings, line 17)
FAIL: gcc.dg/vect/vect-alias-check-4.c  at line 27 (test for warnings, line 25)
FAIL: gcc.dg/vect/vect-alias-check-4.c (test for excess errors)
FAIL: gcc.dg/vect/vect-alias-check-4.c -flto -ffat-lto-objects  (test for 
warnings, line 34)
FAIL: gcc.dg/vect/vect-alias-check-4.c -flto -ffat-lto-objects  at line 19 
(test for warnings, line 17)
FAIL: gcc.dg/vect/vect-alias-check-4.c -flto -ffat-lto-objects  at line 27 
(test for warnings, line 25)
FAIL: gcc.dg/vect/vect-alias-check-4.c -flto -ffat-lto-objects (test for excess 
errors)
FAIL: gcc.dg/vect/vect-bic-bitmask-12.c -flto -ffat-lto-objects  scan-tree-dump 
dce7 "<=s*.+{ 255,.+}"
FAIL: gcc.dg/vect/vect-bic-bitmask-12.c scan-tree-dump dce7 "<=s*.+{ 
255,.+}"
FAIL: gcc.dg/vect/vect-bic-bitmask-23.c -flto -ffat-lto-objects  scan-tree-dump 
dce7 "<=s*.+{ 255, 15, 1, 65535 }"
FAIL: gcc.dg/vect/vect-bic-bitmask-23.c scan-tree-dump dce7 "<=s*.+{ 255, 
15, 1, 65535 }"
FAIL: gcc.dg/vect/vect-multitypes-11.c -flto -ffat-lto-objects  
scan-tree-dump-times vect "vectorized 1 loops" 1
FAIL: gcc.dg/vect/vect-multitypes-11.c scan-tree-dump-times vect "vectorized 1 
loops" 1

All of these are well understood.  For slp-reduc-7.c there is already a fix
posted and we will be needing a vsetvl pass fix after that.

Therefore, I'm going to push this to the trunk.

Note there are also a number of fortran vect failures that we haven't
looked at yet:

=== gfortran tests ===

Running target unix/-march=rv64gcv
FAIL: gfortran.dg/vect/O3-bb-slp-1.f   -O  (test for excess errors)
FAIL: gfortran.dg/vect/O3-bb-slp-2.f   -O  (test for excess errors)
FAIL: 

Re: [PATCH V2] RISC-V: Support VECTOR BOOL vcond_mask optab[PR111337]

2023-09-12 Thread Robin Dapp via Gcc-patches
The PR thing needs to be moved but I can commit it.

Regards
 Robin



Re: [PATCH] RISC-V: Support VECTOR BOOL vcond_mask optab[PR111337]

2023-09-12 Thread Robin Dapp via Gcc-patches
Maybe you want to add PR target/111337 to the changelog?

The rest LGTM.

Regards
 Robin


Re: [PATCH V5] RISC-V: Support Dynamic LMUL Cost model

2023-09-12 Thread Robin Dapp via Gcc-patches
LGTM.  We should just keep in mind the restrictions discussed in the
other thread.

Regards
 Robin


Re: [PATCH V4] RISC-V: Support Dynamic LMUL Cost model

2023-09-12 Thread Robin Dapp via Gcc-patches


> This is first version of dynamic LMUL.
> I didn't test it with full GCC testsuite.
> 
> My plan is to first pass all GCC testsuite (including vect.exp) with default 
> LMUL = M1.
> Then enable dynamic LMUL to test it.
> 
> Maybe we could tolerate this ICE issue for now. Then we can test it
> with full GCC testsuite (I belive we can reproduce with some case in
> GCC testsuite in the future).
> 
> Is that reasonable ? If yes, I will fix all your comments and send V5.

Yes, works for me.

Regards
 Robin



Re: [PATCH V4] RISC-V: Support Dynamic LMUL Cost model

2023-09-12 Thread Robin Dapp via Gcc-patches
> Is calculix big ?

It's 7 nested for loops IIRC and, when unrolling, can get pretty nasty.
I tested with -Ofast -funroll-loops.  I think wrf is even larger, maybe I
can run a full comparison test tonight to have good coverage.

> Could you give me the testcase to reproduce it?

OK, I will try to reduce it, will be Fortran, though.

Regards
 Robin


Re: [PATCH V4] RISC-V: Support Dynamic LMUL Cost model

2023-09-12 Thread Robin Dapp via Gcc-patches
I did some benchmarks and, at least for calculix the differences are
miniscule.  I'd say we can stick with the current approach and improve
as needed.

However, I noticed ICEs here:

+  gcc_assert (biggest_size >= mode_size);

and here:

+  mode = TYPE_MODE (TREE_TYPE (lhs));

when compiling calculix.

Regards
 Robin


Re: [PATCH V4] RISC-V: Support Dynamic LMUL Cost model

2023-09-12 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

> +max_number_of_live_regs (const basic_block bb,
> +  const hash_map _ranges,
> +  unsigned int max_point, machine_mode biggest_mode,
> +  int lmul)
> +{
> +  unsigned int max_nregs = 0;
> +  unsigned int i;
> +  unsigned int live_point = 0;
> +  auto_vec live_vars_vec;
> +  live_vars_vec.safe_grow (max_point + 1, true);
> +  for (i = 0; i < live_vars_vec.length (); ++i)
> +live_vars_vec[i] = 0;
> +  for (hash_map::iterator iter = live_ranges.begin ();
> +   iter != live_ranges.end (); ++iter)
> +{
> +  tree var = (*iter).first;
> +  pair live_range = (*iter).second;
> +  for (i = live_range.first; i <= live_range.second; i++)
> + {
> +   machine_mode mode = TYPE_MODE (TREE_TYPE (var));
> +   unsigned int nregs
> + = compute_nregs_for_mode (mode, biggest_mode, lmul);
> +   live_vars_vec[i] += nregs;
> +   if (live_vars_vec[i] > max_nregs)
> + max_nregs = live_vars_vec[i];
> + }
> +}

My concern is that we have O(nm) here, where n = number of live_ranges
and m = size of live range.  In large basic blocks (think calculix of
SPECfp 2006 which can reach up to 2000 instructions IIRC) this might
become prohibitive.

I'm going to do a quick benchmark with calculix and report back.  If
there is no noticable difference we can ditch my idea.

For short live ranges (like < 10) the O(nm) could be better.  As of now,
we still calculate the nregs n*m times, though.  I have something like
the following in mind (it is definitely not shorter, though):

  struct range {
  unsigned int pt;
  bool start;
  unsigned int nregs;
  };

  auto_vec ranges (2 * live_ranges.elements ());
  for (hash_map::iterator iter = live_ranges.begin ();
   iter != live_ranges.end (); ++iter)
{
  tree var = (*iter).first;
  machine_mode mode = TYPE_MODE (TREE_TYPE (var));
  unsigned int nregs
  = compute_nregs_for_mode (mode, biggest_mode, lmul);
  ranges.quick_push ({(*iter).second.first, true, nregs});
  ranges.quick_push ({(*iter).second.second, false, nregs});
}

  ranges.qsort ([] (const void *a, const void *b) -> int {
unsigned int aa = ((const range *)a)->pt;
unsigned int bb = ((const range *)b)->pt;
if (aa < bb)
  return -1;
if (aa == bb)
  return 0;
return 1;
});

  unsigned int cur = 0;
  max_nregs = ranges[0].nregs;

  for (auto r : ranges)
{
  if (r.start)
cur += r.nregs;
  else
cur -= r.nregs;
  max_nregs = MAX (max_nregs, cur);
}

> +  for (i = 0; i < cfun->gimple_df->ssa_names->length (); i++)
> +{
> +  tree t = ssa_name (i);
> +  if (!t)
> +   continue;

Could likely be replaced by

  tree t;
  FOR_EACH_SSA_NAME (i, t, cfun)

> +static void
> +update_local_live_ranges (
> +  vec_info *vinfo,
> +  hash_map> _points_per_bb,
> +  hash_map> _ranges_per_bb)
> +{

I just realized (sorry) that this is "nested" a bit far.  Can we still
have e.g. 

> +  if (loop_vec_info loop_vinfo = dyn_cast (vinfo))
> +{

this,

> +   if (STMT_VINFO_TYPE (vect_stmt_to_vectorize (stmt_info))
> +   != undef_vec_info_type)

this,

> +   if (live_range)
> + {

and this just "continue"?

Apart from that, LGTM.

Regards
 Robin



Re: [PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-09-11 Thread Robin Dapp via Gcc-patches
Hi,

as Juzhe noticed in gcc.dg/pr92301.c there was still something missing in
the last patch.  The attached v2 makes sure we always have a COND_LEN operation
before returning true and initializes len and bias even if they are unused.

Bootstrapped and regtested on aarch64 and x86.

Regards
 Robin

Subject: [PATCH v2] gimple-match: Do not try UNCOND optimization with
 COND_LEN.

On riscv we mis-optimize conditional (length) operations into
unconditional operations e.g. in slp-reduc-7.c and
gcc.dg/pr92301.c.

This patch prevents optimizing e.g.
 COND_LEN_ADD ({-1, ... }, a, 0, c, len, bias)
unconditionally into just "a".

Currently, we assume that COND_LEN operations can be optimized similarly
to COND operations.  As the length is part of the mask (and usually not
compile-time constant), we must not perform any optimization that relies
on just the mask being "true".  This patch ensures that we still have a
COND_LEN pattern after optimization.

gcc/ChangeLog:

PR target/111311
* gimple-match-exports.cc (maybe_resimplify_conditional_op):
Check for length masking.
(try_conditional_simplification): Check that the result is still
length masked.
---
 gcc/gimple-match-exports.cc | 38 ++---
 gcc/gimple-match.h  |  3 ++-
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
index b36027b0bad..d41de98a3d3 100644
--- a/gcc/gimple-match-exports.cc
+++ b/gcc/gimple-match-exports.cc
@@ -262,7 +262,8 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
gimple_match_op *res_op,
   if (!res_op->cond.cond)
 return false;
 
-  if (!res_op->cond.else_value
+  if (!res_op->cond.len
+  && !res_op->cond.else_value
   && res_op->code.is_tree_code ())
 {
   /* The "else" value doesn't matter.  If the "then" value is a
@@ -301,9 +302,12 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
gimple_match_op *res_op,
 
   /* If the "then" value is a gimple value and the "else" value matters,
  create a VEC_COND_EXPR between them, then see if it can be further
- simplified.  */
+ simplified.
+ Don't do this if we have a COND_LEN_ as that would make us lose the
+ length masking.  */
   gimple_match_op new_op;
-  if (res_op->cond.else_value
+  if (!res_op->cond.len
+  && res_op->cond.else_value
   && VECTOR_TYPE_P (res_op->type)
   && gimple_simplified_result_is_gimple_val (res_op))
 {
@@ -314,7 +318,7 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
gimple_match_op *res_op,
   return gimple_resimplify3 (seq, res_op, valueize);
 }
 
-  /* Otherwise try rewriting the operation as an IFN_COND_* call.
+  /* Otherwise try rewriting the operation as an IFN_COND_(LEN_)* call.
  Again, this isn't a simplification in itself, since it's what
  RES_OP already described.  */
   if (convert_conditional_op (res_op, _op))
@@ -386,9 +390,29 @@ try_conditional_simplification (internal_fn ifn, 
gimple_match_op *res_op,
 default:
   gcc_unreachable ();
 }
-  *res_op = cond_op;
-  maybe_resimplify_conditional_op (seq, res_op, valueize);
-  return true;
+
+  if (len)
+{
+  /* If we had a COND_LEN before we need to ensure that it stays that
+way.  */
+  gimple_match_op old_op = *res_op;
+  *res_op = cond_op;
+  maybe_resimplify_conditional_op (seq, res_op, valueize);
+
+  auto cfn = combined_fn (res_op->code);
+  if (internal_fn_p (cfn)
+ && internal_fn_len_index (as_internal_fn (cfn)) != -1)
+   return true;
+
+  *res_op = old_op;
+  return false;
+}
+  else
+{
+  *res_op = cond_op;
+  maybe_resimplify_conditional_op (seq, res_op, valueize);
+  return true;
+}
 }
 
 /* Helper for the autogenerated code, valueize OP.  */
diff --git a/gcc/gimple-match.h b/gcc/gimple-match.h
index bec3ff42e3e..d192b7dae3e 100644
--- a/gcc/gimple-match.h
+++ b/gcc/gimple-match.h
@@ -56,7 +56,8 @@ public:
 
 inline
 gimple_match_cond::gimple_match_cond (tree cond_in, tree else_value_in)
-  : cond (cond_in), else_value (else_value_in)
+  : cond (cond_in), else_value (else_value_in), len (NULL_TREE),
+bias (NULL_TREE)
 {
 }
 
-- 
2.41.0




Re: [PATCH V3] RISC-V: Support Dynamic LMUL Cost model

2023-09-11 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

glad that we can use the dominator info directly.  Could we move the
calculation of the info to the beginning (if it's not available)?  That
makes it clearer that it's a prerequisite.  Function comments look
good now.

Some general remarks kind of similar to v1:

 - I would prefer a hash_map or similar to hold the end point for a range
   instead of looking through potentially all ranges in contrived cases.

 - As long as we're just looking for the maximum number of live registers,
   we can use a sliding-window approach:  create a structure with all
   start and end points, sort it, and increase the current pressure
   if we start a new range or decrease.  That's O(n log n).

> +  const ssa_use_operand_t *const head = &(SSA_NAME_IMM_USE_NODE (t));
> +  const ssa_use_operand_t *ptr;
> +
> +  for (ptr = head->next; ptr != head; ptr = ptr->next)
> + {

Why does FOR_EACH_IMM_USE not work here?

> +   unsigned int max_point
> + = (*program_points_per_bb.get (e->src)).length () - 1;
> +   for (k = 0; k < (*live_ranges).length (); k++)
> + {
> +   if ((*live_ranges)[i].var == def)

Would also be nice not having to search through all ranges but just index/hash
it via var (or similar).

What about one test with global live ranges?  Not a necessity IMHO we can still
add it later.

Regards
 Robin



[PATCH] match: Don't sink comparisons into vec_cond operands.

2023-09-08 Thread Robin Dapp via Gcc-patches
Hi,

on riscv gcc.dg/pr70252.c ICEs at gimple-isel.cc:283.  This is because
we created the gimple statement

  mask__7.36_170 = VEC_COND_EXPR ;

during vrp2.

What happens is that, starting with
  maskdest = (vec_cond mask1 1 0) >= (vec_cond mask2 1 0)
we fold to
  maskdest = mask1 >= (vec_cond (mask2 1 0))
and then sink the "mask1 >=" into the vec_cond so we end up with
  maskdest = vec_cond (mask2 ? mask1 : 0),
i.e. a vec_cond with a mask "data mode".

In gimple-isel, when the target does not provide a vcond_mask
implementation for that (which none does) we fail the assertion that the
mask mode be MODE_VECTOR_INT.

To prevent this, this patch restricts the match.pd sinking pattern to
non-mask types.  I was also thinking about restricting the type of
the operands, wondering if that would be less intrusive.

Bootstrapped and regression-tested on x86 and aarch64.

Regards
 Robin

gcc/ChangeLog:

PR target/111337
* match.pd: Do not sink comparisons into vec_conds when the type
is a vector mask.
---
 gcc/match.pd | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 8c24dae71cd..db3e698f471 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4856,7 +4856,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (vec_cond @0 (view_convert! @1) (view_convert! @2
 
 /* Sink binary operation to branches, but only if we can fold it.  */
-(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor
+(for op (plus minus mult bit_and bit_ior bit_xor
 lshift rshift rdiv trunc_div ceil_div floor_div round_div
 trunc_mod ceil_mod floor_mod round_mod min max)
 /* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
@@ -4872,6 +4872,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (op @3 (vec_cond:s @0 @1 @2))
   (vec_cond @0 (op! @3 @1) (op! @3 @2
 
+/* Comparison sinks might be folded into vector masks which could
+   end up as "data" operand of a vec_cond
+   e.g. (vec_cond @0 (mask1) (...)).
+   gimple-isel does not handle such cases if the target does not provide
+   a vcond_mask.  Therefore, restrict the operands to non-mask classes.  */
+(for op (tcc_comparison)
+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (if (GET_MODE_CLASS (TYPE_MODE (type)) != MODE_VECTOR_BOOL)
+(vec_cond @0 (op! @1 @3) (op! @2 @4
+
+/* (c ? a : b) op d  -->  c ? (a op d) : (b op d) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) @3)
+  (if (GET_MODE_CLASS (TYPE_MODE (type)) != MODE_VECTOR_BOOL)
+(vec_cond @0 (op! @1 @3) (op! @2 @3
+ (simplify
+  (op @3 (vec_cond:s @0 @1 @2))
+  (if (GET_MODE_CLASS (TYPE_MODE (type)) != MODE_VECTOR_BOOL)
+(vec_cond @0 (op! @3 @1) (op! @3 @2)
+
 #if GIMPLE
 (match (nop_atomic_bit_test_and_p @0 @1 @4)
  (bit_and (convert?@4 (ATOMIC_FETCH_OR_XOR_N @2 INTEGER_CST@0 @3))
-- 
2.41.0



[PATCH] gimple-match: Do not try UNCOND optimization with COND_LEN.

2023-09-08 Thread Robin Dapp via Gcc-patches
Hi,

found in slp-reduc-7.c, this patch prevents optimizing e.g.
 COND_LEN_ADD ({-1, ... }, a, 0, c, len, bias)
unconditionally into just "a".

Currently, we assume that COND_LEN operations can be optimized similarly
to COND operations.  As the length is part of the mask (and usually not
compile-time constant), we must not perform any optimization that relies
on just the mask being "true".

Bootstrap and testsuite are unchanged on aarch64 and x86.

Regards
 Robin

gcc/ChangeLog:

* gimple-match-exports.cc (maybe_resimplify_conditional_op):
Check for length masking.
---
 gcc/gimple-match-exports.cc | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
index b36027b0bad..73be9f4f4c3 100644
--- a/gcc/gimple-match-exports.cc
+++ b/gcc/gimple-match-exports.cc
@@ -262,7 +262,8 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
gimple_match_op *res_op,
   if (!res_op->cond.cond)
 return false;
 
-  if (!res_op->cond.else_value
+  if (!res_op->cond.len
+  && !res_op->cond.else_value
   && res_op->code.is_tree_code ())
 {
   /* The "else" value doesn't matter.  If the "then" value is a
@@ -301,9 +302,12 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
gimple_match_op *res_op,
 
   /* If the "then" value is a gimple value and the "else" value matters,
  create a VEC_COND_EXPR between them, then see if it can be further
- simplified.  */
+ simplified.
+ Don't do this if we have a COND_LEN_ as that would make us lose the
+ length masking.  */
   gimple_match_op new_op;
-  if (res_op->cond.else_value
+  if (!res_op->cond.len
+  && res_op->cond.else_value
   && VECTOR_TYPE_P (res_op->type)
   && gimple_simplified_result_is_gimple_val (res_op))
 {
@@ -314,7 +318,7 @@ maybe_resimplify_conditional_op (gimple_seq *seq, 
gimple_match_op *res_op,
   return gimple_resimplify3 (seq, res_op, valueize);
 }
 
-  /* Otherwise try rewriting the operation as an IFN_COND_* call.
+  /* Otherwise try rewriting the operation as an IFN_COND_(LEN_)* call.
  Again, this isn't a simplification in itself, since it's what
  RES_OP already described.  */
   if (convert_conditional_op (res_op, _op))
-- 
2.41.0



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Robin Dapp via Gcc-patches
Thanks for looking at it in detail.

> Yeah, I think this is potentially a blocker for propagating A into B
> when A is used elsewhere.  Combine is able to combine A and B while
> keeping A in parallel with the result.  I think either fwprop would
> need to try that too, or it would need to be restricted to cases where A
> is only used in B.

That seems a rather severe limitation and my original use case would
not get optimized considerably anymore.  The intention was to replace
all uses (if register pressure allows).  Of course the example is simple
enough that a propagation is always useful if the costs allow it, so
it might not be representative.

I'm wondering if we could (my original misunderstanding) tentatively
try to propagate into all uses of a definition and, when reaching
a certain ratio, decide that it might be worth it, otherwise revert.
Would be very crude though, and not driven by the actual problem we're
trying to avoid. 

> I think the summary is:
> 
> IMO, we have to be mindful that combine is still to run.  We need to
> avoid making equal-cost changes if the new form is more complex, or
> otherwise likely to interfere with combine.

I guess we don't have a good measure for complexity or "combinability"
and even lower-cost changes could result in worse options later.
Would it make sense to have a strict less-than cost policy for those
more complex propagations?  Or do you consider the approach in its
current shape "hopeless", given the complications we discussed?

> Alternatively, we could delay the optimisation until after combine
> and have freer rein, since we're then just mopping up opportunities
> that other passes left behind.
> 
> A while back I was experimenting with a second combine pass.  That was
> the original motiviation for rtl-ssa.  I never got chance to finish it
> off though.

This doesn't sound like something that would still materialize before
the end of stage 1 :)
Do you see any way of restricting the current approach to make it less
intrusive and still worthwhile?  Limiting to vec_duplicate might be
much too arbitrary but would still help for my original example.

Regards
 Robin



Re: [PATCH] RISC-V: Add VLS mask modes mov patterns[PR111311]

2023-09-07 Thread Robin Dapp via Gcc-patches
I have an almost identical patch locally that passed testing as well
but didn't get around to posting it yet.  Therefore LGTM.

Regards
 Robin


Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-07 Thread Robin Dapp via Gcc-patches
> Thanks for giving it a go.  Can you post the latest version of the
> regpressure patch too?  The previous on-list version I could find
> seems to be too old.

Oh, sure, attached.  Apologies, I added the regpressure_same_class
convenience helper but forgot to re-send it.

Regards
 Robin

>From d3f87e4de7d7d05a2fcf8c948097b14eadf08c90 Mon Sep 17 00:00:00 2001
From: Robin Dapp 
Date: Mon, 24 Jul 2023 16:25:38 +0200
Subject: [PATCH] gcse: Extract reg pressure handling into separate file.

This patch extracts the hoist-pressure handling from gcse and puts it
into a separate file so it can be used by other passes in the future.
No functional change.

gcc/ChangeLog:

* Makefile.in: Add regpressure.o.
* gcse.cc (struct bb_data): Move to regpressure.cc.
(BB_DATA): Ditto.
(get_regno_pressure_class): Ditto.
(get_pressure_class_and_nregs): Ditto.
(record_set_data): Ditto.
(update_bb_reg_pressure): Ditto.
(should_hoist_expr_to_dom): Ditto.
(hoist_code): Ditto.
(change_pressure): Ditto.
(calculate_bb_reg_pressure): Ditto.
(one_code_hoisting_pass): Ditto.
* gcse.h (single_set_gcse): Export single_set_gcse.
* regpressure.cc: New file.
* regpressure.h: New file.
---
 gcc/Makefile.in|   1 +
 gcc/gcse.cc| 304 ++-
 gcc/gcse.h |   2 +
 gcc/regpressure.cc | 391 +
 gcc/regpressure.h  |  48 ++
 5 files changed, 459 insertions(+), 287 deletions(-)
 create mode 100644 gcc/regpressure.cc
 create mode 100644 gcc/regpressure.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 5930b52462a..62768a84f81 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1610,6 +1610,7 @@ OBJS = \
reg-stack.o \
regcprop.o \
reginfo.o \
+   regpressure.o \
regrename.o \
regstat.o \
reload.o \
diff --git a/gcc/gcse.cc b/gcc/gcse.cc
index f689c0c2687..5bafef7970f 100644
--- a/gcc/gcse.cc
+++ b/gcc/gcse.cc
@@ -160,6 +160,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gcse.h"
 #include "gcse-common.h"
 #include "function-abi.h"
+#include "regpressure.h"
 
 /* We support GCSE via Partial Redundancy Elimination.  PRE optimizations
are a superset of those done by classic GCSE.
@@ -419,30 +420,6 @@ static bool doing_code_hoisting_p = false;
 /* For available exprs */
 static sbitmap *ae_kill;
 
-/* Data stored for each basic block.  */
-struct bb_data
-{
-  /* Maximal register pressure inside basic block for given register class
- (defined only for the pressure classes).  */
-  int max_reg_pressure[N_REG_CLASSES];
-  /* Recorded register pressure of basic block before trying to hoist
- an expression.  Will be used to restore the register pressure
- if the expression should not be hoisted.  */
-  int old_pressure;
-  /* Recorded register live_in info of basic block during code hoisting
- process.  BACKUP is used to record live_in info before trying to
- hoist an expression, and will be used to restore LIVE_IN if the
- expression should not be hoisted.  */
-  bitmap live_in, backup;
-};
-
-#define BB_DATA(bb) ((struct bb_data *) (bb)->aux)
-
-static basic_block curr_bb;
-
-/* Current register pressure for each pressure class.  */
-static int curr_reg_pressure[N_REG_CLASSES];
-
 
 static void compute_can_copy (void);
 static void *gmalloc (size_t) ATTRIBUTE_MALLOC;
@@ -494,8 +471,6 @@ static bool should_hoist_expr_to_dom (basic_block, struct 
gcse_expr *,
  enum reg_class,
  int *, bitmap, rtx_insn *);
 static bool hoist_code (void);
-static enum reg_class get_regno_pressure_class (int regno, int *nregs);
-static enum reg_class get_pressure_class_and_nregs (rtx_insn *insn, int 
*nregs);
 static bool one_code_hoisting_pass (void);
 static rtx_insn *process_insert_insn (struct gcse_expr *);
 static bool pre_edge_insert (struct edge_list *, struct gcse_expr **);
@@ -2402,7 +2377,7 @@ record_set_data (rtx dest, const_rtx set, void *data)
 }
 }
 
-static const_rtx
+const_rtx
 single_set_gcse (rtx_insn *insn)
 {
   struct set_data s;
@@ -2804,72 +2779,6 @@ compute_code_hoist_data (void)
 fprintf (dump_file, "\n");
 }
 
-/* Update register pressure for BB when hoisting an expression from
-   instruction FROM, if live ranges of inputs are shrunk.  Also
-   maintain live_in information if live range of register referred
-   in FROM is shrunk.
-   
-   Return 0 if register pressure doesn't change, otherwise return
-   the number by which register pressure is decreased.
-   
-   NOTE: Register pressure won't be increased in this function.  */
-
-static int
-update_bb_reg_pressure (basic_block bb, rtx_insn *from)
-{
-  rtx dreg;
-  rtx_insn *insn;
-  basic_block succ_bb;
-  df_ref use, op_ref;
-  edge succ;
-  edge_iterator ei;
-  int decreased_pressure = 0;
-  int nregs;
- 

Re: [PATCH] RISC-V: Remove unreasonable TARGET_64BIT for VLS modes with size = 64bit

2023-09-06 Thread Robin Dapp via Gcc-patches
LGTM.

Regards
 Robin



Re: [PATCH] RISC-V: Fix VSETVL PASS AVL/VL fetch bug[111295]

2023-09-06 Thread Robin Dapp via Gcc-patches
OK.

Regards
 Robin


Re: [PATCH v3] RISC-V: Add autovec FP binary operations.

2023-09-06 Thread Robin Dapp via Gcc-patches


Hi Lehua,

> May I ask if the compiler options "-march=rv64gcv_zvfh -mabi=lp64d"
> should be removed? Because we don't specify -march and -mabi when we
> run testcase (so but why we need to specify the -march and -mabi in
> this target check?), we run it with the default values. Assuming that
> the default is rv64gcv_zvfh_zfh, `riscv_vector` check will fail
> because compile and link with -march=rv64gcv will throw the following
> error if I doesn't compile a multilibs compilers. But in fact
> rv64gcv_zvfh_zfh contains rv64gcv, we should not let this case report
> link error.:

Yes, you're right, this should not be necessary.  This is more of a
test for the execution environment than the compiler.  I think Juzhe
removed it already in his patch that enables the vector test suite.
In the future we would use the checks that Joern added, not sure if
they are already upstream.

Regards
 Robin



Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-06 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

general remark upfront:  Please add function-level comments for all
functions.  This makes reading and reviewing much easier.  I had to sweep
back and forth quite a bit.

> +
> +static int
> +get_last_live_range (const vec _ranges, tree var)
> +{
> +  unsigned int ix;
> +  var_live_range *live_range;
> +  FOR_EACH_VEC_ELT_REVERSE (live_ranges, ix, live_range)
> +if (live_range->var == var)
> +  return ix;
> +  return -1;
> +}

>From reading the usage site of this function it looks like we could benefit
from having the live ranges be a hash_map as well?  That way we wouldn't
need to scan through the list every time.  Something like
hash_map>.  It looks like we only consider the range
end anyway.

> +   int index = get_last_live_range (live_ranges, var);

That way we could avoid some worst-case behavior here for pathological
inputs.

> +   if (index == -1)
> + {
> +   var_live_range range = {var, 0, point};
> +   live_ranges.safe_push (range);
> + }

Please add a comment that we assume the variable is live from the start
of this block. 

> +   else
> + live_ranges[index].end = point;

And here a comment that we will grow the live range for each use.

> +static bool
> +live_range_conflict_p (const var_live_range _range1,
> +const var_live_range _range2)
> +{
> +  if (live_range1.start >= live_range2.end)
> +return false;
> +  if (live_range1.end <= live_range2.start)
> +return false;
> +  if (live_range2.start >= live_range1.end)
> +return false;
> +  if (live_range2.end <= live_range1.start)
> +return false;
> +  return true;
> +}

Rename to live_range_overlap_p and simplify to
 return a.end >= b.start || b.end >= a.start;

> +
> +static unsigned int
> +max_number_of_live_regs (const basic_block bb,
> +  const vec _ranges,
> +  machine_mode biggest_mode, int lmul)
> +{
> +  unsigned int max_nregs = 0;
> +  unsigned int i, j, k;
> +  unsigned int live_point = 0;
> +  for (i = 0; i < live_ranges.length (); i++)
> +{
> +  auto_vec conflict_live_ranges;
> +  var_live_range live_range = live_ranges[i];
> +  conflict_live_ranges.safe_push (live_range);
> +  unsigned int min_point = live_range.start;
> +  unsigned int max_point = live_range.end;
> +  for (j = 0; j < live_ranges.length (); j++)
> + {
> +   if (j == i)
> + continue;
> +   if (live_range_conflict_p (live_range, live_ranges[j]))
> + {
> +   conflict_live_ranges.safe_push (live_ranges[j]);
> +   min_point
> + = std::min (min_point, (unsigned int) live_ranges[j].start);
> +   max_point
> + = std::max (max_point, (unsigned int) live_ranges[j].end);
> + }
> + }
> +  for (j = min_point; j <= max_point; j++)
> + {
> +   unsigned int nregs = 0;
> +   for (k = 0; k < conflict_live_ranges.length (); k++)
> + {
> +   if (j >= (unsigned int) conflict_live_ranges[k].start
> +   && j <= (unsigned int) conflict_live_ranges[k].end)
> + {
> +   machine_mode mode
> + = TYPE_MODE (TREE_TYPE (conflict_live_ranges[k].var));
> +   nregs += compute_nregs_for_mode (mode, biggest_mode, lmul);
> + }
> + }
> +   if (nregs > max_nregs)
> + {
> +   max_nregs = nregs;
> +   live_point = j;
> + }
> + }
> +}

This looks pretty quadratic in the number of live ranges (or even cubic?).
Can't it be done more efficiently using a sliding-window approach by sorting
the live ranges according to their start point before?
Also std::min/max -> MIN/MAX.

> +
> +  /* Collect user explicit RVV type.  */
> +  hash_set all_preds = get_all_predecessors (bb);
> +  hash_set all_succs = get_all_successors (bb);

As mentioned before, maybe dominator info could help here?

> +  for (i = 0; i < cfun->gimple_df->ssa_names->length (); i++)
> +{
> +  tree t = ssa_name (i);
> +  if (!t)
> + continue;
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (t));
> +  if (!lookup_vector_type_attribute (TREE_TYPE (t))
> +   && !riscv_v_ext_vls_mode_p (mode))
> + continue;
> +
> +  gimple *def = SSA_NAME_DEF_STMT (t);
> +  if (gimple_bb (def) && !all_preds.contains (gimple_bb (def)))
> + continue;
> +  const ssa_use_operand_t *const head = &(SSA_NAME_IMM_USE_NODE (t));
> +  const ssa_use_operand_t *ptr;
> +
> +  for (ptr = head->next; ptr != head; ptr = ptr->next)
> + {
> +   if (USE_STMT (ptr) && !is_gimple_debug (USE_STMT (ptr)))
> + {
> +   if (all_succs.contains (gimple_bb (USE_STMT (ptr
> + {

Reverse the conditions and continue, i.e. if (!USE_STMT || is_gimple_debug
 || !all_succs.contains).

> +
> +static int
> 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-06 Thread Robin Dapp via Gcc-patches
Hi Richard,

I did some testing with the attached v2 that does not restrict to UNARY
anymore.  As feared ;) there is some more fallout that I'm detailing below.

On Power there is one guality fail (pr43051-1.c) that I would take
the liberty of ignoring for now.

On x86 there are four fails:

 - cond_op_addsubmuldiv__Float16-2.c: assembler error
   unsupported masking for `vmovsh'.  I guess that's a latent backend
   problem.

 - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
   (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
   That looks like a costing problem and can hopefully solveable by making
   the second compare more expensive, preventing the propagation.
   i386 costing (or every costing?) is brittle so that could well break other
   things. 

 - pr88873.c: This is interesting because even before this patch we
   propagated with different register classes (V2DF vs DI).  With the patch
   we check the register pressure, find the class NO_REGS for V2DF and
   abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
   of keeping the old behavior for reg-reg propagations and only checking
   the pressure for more complex operations.

aarch64 has the most fails:

 - One guality fail (same as Power).
 - shrn-combine-[123].c as before.

 - A class of (hopefully, I only checked some) similar cases where we
   propagate an unspec_whilelo into an unspec_ptest.  Before we would only
   set a REG_EQUALS note.
   Before we managed to create a while_ultsivnx16bi_cc whereas now we have
   while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
   We create redundant whilelos and I'm not sure how to improve that. I
   guess a peephole is out of the question :)

 - pred-combine-and.c: Here the new propagation appears useful at first.
   We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
   individual and registers remain live up to the propagation site (while
   being dead before the patch).
   With the registers dead, combine could create a single fcmgt before.
   Now it only manages a 2->2 combination because we still need the registers
   and end up with two fcmgts.
   The code is worse but this seems more bad luck than anything.

 - Addressing fails from before:  I looked into these and suspect all of
   them are a similar.
   What happens is that we have a poly_int offset that we shift, negate
   and then add to x0.  The result is used as load address.
   Before, we would pull (combine) the (plus x0 reg) into the load keeping
   the neg and shift.
   Now we propagate everything into a single (set (minus x0 offset)).
   The propagation itself seems worthwhile because we save one insn.
   However as we got rid of the base/offset split by lumping everything
   together, combine cannot pull the (plus) into the address load and
   we require an aarch64_split_add_offset.  This will emit the longer
   sequence of ashiftl and subtract.  The "base" address is x0 here so
   we cannot convert (minus x0 ...)) into neg.
   I didn't go through all of aarch64_split_add_offset.  I suppose we
   could re-add the separation of base/offset there but that might be
   a loss when the result is not used as an address. 
   
Again, all in all no fatal problems but pretty annoying :)  It's not much
but just gradually worse than with just UNARY.  Any idea on how/whether to
continue?

Regards
 Robin

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
---
 gcc/fwprop.cc | 359 +-
 .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
 2 files changed, 419 insertions(+), 4 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..ce6f5a74b00 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@ using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  

Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-05 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

I think the general approach makes sense and it doesn't need to be perfect
from the beginning as we can always iterate on it.  Before continuing with a
more detailed review (hopefully tomorrow) some high-level questions upfront.
It would help to document some of these choices so it's easier to understand
the rationale.

 - Why don't we use the normal reverse postorder (or postorder) approach of
   computing live ranges?  Is that because we don't really need full global
   live ranges?

 - Why can't we use existing code i.e. tree-ssa-live?  I suspect I already
   know the answer but an explanation (in a comment) would still be useful.

 - Do we really need get_all_predecessors/get_all_successors?  As they're
   only used for "defined before" and "used after", at first glance it
   looks like some kind of dominance info could help there but I didn't
   really check in detail.

 - Why don't we use bitmaps/sbitmaps like in vsetvl.cc and other related
   passes?  I don't mind maps but just wonder if it's on purpose, for
   convenience or something else. 

Besides, it might help to rename program_points_map (into program_points_per_bb
or so).  At first it looked quadratic to me but we're just iterating over
the program points of a BB.

Regards
 Robin



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Robin Dapp via Gcc-patches
> I imagine doing it in reverse postorder would still make sense.
> 
> But my point was that, for the current fwprop limitation of substituting
> into exactly one use of a register, we can check whether that use is
> the *only* use of register.
> 
> I.e. if we substitute:
> 
>   A: (set (reg R1) (foo (reg R2)))
> 
> into:
> 
>   B: (set ... (reg R1) ...)
> 
> if R1 and R2 are likely to be in the same register class, and if B
> is the only user of R2, then we don't need to calculate register
> pressure.  The change is either neutral (if R2 died in A) or an
> improvement (if R2 doesn't die in A, and so R1 and R2 were previously
> live at the same time).

Ah, understood, thanks.  Sure, that one I can include.

Regards
 Robin


Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Robin Dapp via Gcc-patches
> So I don't think I have a good feel for the advantages and disadvantages
> of doing this.  Robin's analysis of the aarch64 changes was nice and
> detailed though.  I think the one that worries me most is the addressing
> mode one.  fwprop is probably the first chance we get to propagate adds
> into addresses, and virtual register elimination means that some of
> those opportunities won't show up in gimple.
> 
> There again, virtual register elimination wouldn't be the reason for
> the ld4_s8.c failure.  Perhaps there's something missing in expand.
> 
> Other than that, I think my main question is: why just unary operations?
> Is the underlying assumption that we only want to propagate a maximum of
> one register?  If so, then I think we should check for that directly, by
> iterating over subrtxes.

The main reason for stopping at unary operations was to limit the scope
and change as little as possible (not restricting the change to one
register).  I'm currently testing a v2 that iterates over subrtxs.

> Perhaps we should allow the optimisation without register-pressure
> information if (a) the source register and destination register are
> in the same pressure class and (b) all uses of the destination are
> being replaced.  (FWIW, rtl-ssa should make it easier to try to
> replace all definitions at once, with an all-or-nothing choice,
> if we ever wanted to do that.)

I presume you're referring to replacing one register (dest) in all using
insns?  Source and destination are somewhat overloaded in fwprop context
because I'm thinking of the "to be replaced" register as dest when it's
actually the replacement register.

AFAICT fwprop currently iterates over insns, going through all their uses
and trying if an individual use can be substituted.  Do you suggest to
change this general iteration order to iterate over the defs of an insn
and then try to replace all the uses at once (e.g. using ssa->change_insns)?
When keeping the current order, wouldn't we need to store all potential
changes instead of committing them and later apply them in bulk, e.g.
grouped by use?  This order would also help to pick the propagation
with the most number of uses (i.e. propagation potential) but maybe
I'm misunderstanding?

Regards
 Robin



Re: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-09-01 Thread Robin Dapp via Gcc-patches
> It's not just a question of which byte though.  It's also a question
> of which bit.
> 
> One option would be to code-generate for even X and for odd X, and select
> between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.
> 
> Otherwise I think we need to treat the bit position as a variable,
> with bitpos % 8 and bitpos / 8 being calculated at runtime.

Thanks.  I worked around it with a backend vec_extractQI expander
so we don't run into that situation directly anymore.  The problem is of
course still latent and I'm going to look at it again after some other things
on my plate.

Regards
 Robin


Re: [PATCH 4/4] RISC-V: Add conditional autovec convert(INT<->FP) patterns

2023-09-01 Thread Robin Dapp via Gcc-patches
This one is OK as well, thanks.

Regards
 Robin


Re: [PATCH 3/4] RISC-V: Add conditional autovec convert(FP<->FP) patterns

2023-09-01 Thread Robin Dapp via Gcc-patches
Hi Lehua,

this is OK, thanks.

Regards
 Robin



Re: [PATCH 2/4] RISC-V: Add conditional autovec convert(INT<->INT) patterns

2023-09-01 Thread Robin Dapp via Gcc-patches
Hi Lehua,

this LGTM now, thanks.  It's also easier to read after the refactor :)

Regards
 Robin



Re: [PATCH 1/4] RISC-V: Adjust expand_cond_len_{unary,binop,op} api

2023-09-01 Thread Robin Dapp via Gcc-patches
Thanks, LGTM.

Btw. I haven't forgotten to respond to your last refactor but just didn't find
the time yet.  I figured I should have some proper draft before suggesting
more things :)

Regards
 Robin



Re: [PATCH] RISC-V: Add dynamic LMUL compile option

2023-09-01 Thread Robin Dapp via Gcc-patches
LGTM

Regards
 Robin



Re: [PATCH] RISC-V: Enable VECT_COMPARE_COSTS by default

2023-09-01 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

thanks, this is OK, we would have needed this sooner or later anyway.

Regards
 Robin



[PATCH] RISC-V: Add vec_extract for BI -> QI.

2023-09-01 Thread Robin Dapp via Gcc-patches
Hi,

this patch adds a vec_extract expander that extracts a QImode from a
vector mask mode.  In doing so, it helps recognize a "live operation"/extract
last idiom for mask modes.
It fixes the ICE in tree-vect-live-6.c by circumventing the fallback
code in extract_bit_field_1.  The problem there is still latent, though,
and needs to be addressed separately.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/autovec.md (vec_extractqi): New expander.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/partial/live-2.c: New test.
* gcc.target/riscv/rvv/autovec/partial/live_run-2.c: New test.
---
 gcc/config/riscv/autovec.md   | 36 +
 .../riscv/rvv/autovec/partial/live-2.c| 31 +++
 .../riscv/rvv/autovec/partial/live_run-2.c| 54 +++
 3 files changed, 121 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live_run-2.c

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index ebe1b10aa12..2e3e8e720a5 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -1409,6 +1409,42 @@ (define_expand "vec_extract"
   DONE;
 })
 
+;; -
+;; This extracts a bit (via QImode) from a bitmask vector.
+;; -
+(define_expand "vec_extractqi"
+  [(set (match_operand:QI0 "register_operand")
+ (vec_select:QI
+   (match_operand:VB 1 "register_operand")
+   (parallel
+[(match_operand  2 "nonmemory_operand")])))]
+  "TARGET_VECTOR"
+{
+  /* Create an empty byte vector and set it to one under mask.  */
+  machine_mode qimode = riscv_vector::get_vector_mode
+  (QImode, GET_MODE_NUNITS (mode)).require ();
+
+  rtx tmp1 = gen_reg_rtx (qimode);
+  emit_move_insn (tmp1, gen_const_vec_duplicate (qimode, GEN_INT (0)));
+  rtx ones = gen_const_vec_duplicate (qimode, GEN_INT (1));
+
+  rtx ops1[] = {tmp1, tmp1, ones, operands[1]};
+  riscv_vector::emit_vlmax_insn (code_for_pred_merge (qimode),
+riscv_vector::MERGE_OP, ops1);
+
+  /* Slide down the requested byte element.  */
+  rtx tmp2 = gen_reg_rtx (qimode);
+
+  rtx ops2[] = {tmp2, tmp1, operands[2]};
+  riscv_vector::emit_vlmax_insn
+(code_for_pred_slide (UNSPEC_VSLIDEDOWN, qimode),
+ riscv_vector::BINARY_OP, ops2);
+
+  /* Extract it.  */
+  emit_insn (gen_pred_extract_first (qimode, operands[0], tmp2));
+  DONE;
+})
+
 ;; -
 ;;  [FP] Binary operations
 ;; -
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-2.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-2.c
new file mode 100644
index 000..69c2a44219a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-2.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv_zvfh -mabi=ilp32d 
-fno-vect-cost-model --param riscv-autovec-preference=scalable 
-fdump-tree-optimized-details" } */
+
+#include 
+
+#define EXTRACT_LAST(TYPE) 
\
+  _Bool __attribute__ ((noipa))
\
+  test_##TYPE (TYPE *restrict x, TYPE *restrict y, int n) \
+  {
\
+_Bool last;
\
+for (int j = 0; j < n; ++j)
\
+  {
\
+   last = !x[j];  \
+   y[j] = last;   \
+  }
\
+return last;   
\
+  }
+
+#define TEST_ALL(T)
\
+  T (int8_t)   
\
+  T (int16_t)  
\
+  T (int32_t)  
\
+  T (int64_t)  
\
+  T (uint8_t)  
\
+  T (uint16_t) 
\
+  T (uint32_t) 
\
+  T (uint64_t)
+
+TEST_ALL (EXTRACT_LAST)
+
+/* { dg-final { scan-tree-dump-times 

Re: [PATCH] RISC-V: Add Vector cost model framework for RVV

2023-08-31 Thread Robin Dapp via Gcc-patches
OK.  As it doesn't do anything and we'll be needing it anyway no harm
in adding it.

Regards
 Robin


[PATCH] testsuite/vect: Make match patterns more accurate.

2023-08-31 Thread Robin Dapp via Gcc-patches
Hi,

on some targets we fail to vectorize with the first type the vectorizer
tries but succeed with the second.  This patch changes several regex
patterns to reflect that behavior.

Before we would look for a single occurrence of e.g.
"vect_recog_dot_prod_pattern" but would possible find two (one for each
attempted mode).  The new pattern tries to match sequences where we
first have a "vect_recog_dot_prod_pattern" and a "succeeded" afterwards
while making sure there is no "failed" or "Re-trying" in between.

I realized we already only do scan-tree-dump instead of
scan-tree-dump-times in some related testcases, probably for the same
reason but I didn't touch them for now.

Testsuite unchanged on x86, aarch64 and Power10.

Regards
 Robin

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-reduc-dot-s16a.c: Adjust regex pattern.
* gcc.dg/vect/vect-reduc-dot-s8a.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-s8b.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u16a.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u16b.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u8a.c: Ditto.
* gcc.dg/vect/vect-reduc-dot-u8b.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-1a.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-1b-big-array.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-1c-big-array.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-2a.c: Ditto.
* gcc.dg/vect/vect-reduc-pattern-2b-big-array.c: Ditto.
* gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Ditto.
---
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c  | 4 ++--
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c  | 4 ++--
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u16a.c | 5 +++--
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u16b.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u8a.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-u8b.c  | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-1a.c   | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-1b-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-1c-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-2a.c   | 2 +-
 gcc/testsuite/gcc.dg/vect/vect-reduc-pattern-2b-big-array.c | 2 +-
 gcc/testsuite/gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c| 4 ++--
 13 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c
index ffbc9706901..d826828e3d6 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s16a.c
@@ -51,7 +51,7 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 
"vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_sdot_hi } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_widen_mult_hi_to_si } } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c
index 05e343ad782..4e1e0b234f4 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8a.c
@@ -55,8 +55,8 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 
1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
vect_sdot_qi } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { 
vect_widen_mult_qi_to_hi && vect_widen_sum_hi_to_si } } } } */
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c
index 82c648cc73c..cb88ad5b639 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-s8b.c
@@ -53,8 +53,8 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 
"vect" { xfail *-*-* } } } */
-/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: detected" 
1 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 "vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: 
detected(?:(?!failed)(?!Re-trying).)*succeeded" 1 

Re: [PATCH] RISC-V: Refactor and clean emit_{vlmax,nonvlmax}_xxx functions

2023-08-31 Thread Robin Dapp via Gcc-patches
Hi Lehua,

thanks, this definitely goes into the direction of what I had in mind and
simplifies a lot of the reduntant emit_... so it's good to have it.

I was too slow for a detailed response :)  So just some high-level comments.

One thing I noticed is the overloading of "MASK_OP",  we use it as
"operation on masks" i.e. an insn as well as "mask policy".  IMHO we could
get rid of UNARY_MASK_OP and BINARY_MASK_OP and just decide whether to
add a mask policy depending on if all operands are masks (the same way we
did before).

Related, and seeing that the MASK in UNARY_MASK_OP is somewhat redundant,
I feel we still mix concerns a bit.  For example it is not obvious, from
the name at least, why a WIDEN_TERNARY_OP does not have a merge operand
and the decision making seems very "enum centered" now :D

In general we use the NULLARY, UNARY, BINARY, TERNARY prefixes just
to determine the number of sources which doesn't seem really necessary
because a user of e.g. NEG will already know that there only is one
source - he already specified it and currently needs to, redundantly,
say UNARY again.
 
If we split off the destination and sources from mask, merge and the rest
we could ditch them altogether.

What about
 emit_(non)vlmax_insn (icode, *operands (just dest and sources),
   mask, merge, tail/mask policy, frm)

with mask defaulting to NULL and merge defaulting to VUNDEF?  So ideally,
and in the easy case, the call would just degenerate to
 emit_..._insn (icode, operands).
 
I realize this will cause some complications on the "other side" but with
the enum in place it should still be doable?

No need to address this right away though, just sharing some ideas again.

Regards
 Robin



Re: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-08-30 Thread Robin Dapp via Gcc-patches
> But in the VLA case, doesn't it instead have precision 4+4X?
> The problem then is that we can't tell at compile time which
> byte that corresponds to.  So...

Yes 4 + 4x.  I keep getting confused with poly modes :)
In this case we want to extract the bitnum [3 4] = 3 + 4x which
would be in byte 0 for x = 0 or x = 1 and in byte 1 for x = 2, 3 and
so on.

Can't we still make that work somehow?  As far as I can tell we're looking
for the byte range to be accessed.  It's not like we have a precision or
bitnum of e.g. [3 17] where the access could be anywhere but still a pow2
fraction of BITS_PER_UNIT.

I'm just having trouble writing that down.

What about something like

int factor = BITS_PER_UINT / prec.coeffs[0];
bytenum = force_align_down_and_div (bitnum, prec.coeffs[0]);
bytenum *= factor;

(or a similar thing done manually without helpers) guarded by the
proper condition?
Or do we need something more generic for the factor (i.e. prec.coeffs[0])
is not enough when we have a precision like [8 16]? Does that even exist?.

Regards
 Robin


[PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-08-30 Thread Robin Dapp via Gcc-patches
Hi,

when looking at a riscv ICE in vect-live-6.c I noticed that we
assume that the variable part (coeffs[1] * x1) of the to-be-extracted
bit number in extract_bit_field_1 is a multiple of BITS_PER_UNIT.

This means that bits_to_bytes_round_down and num_trailing_bits
cannot handle e.g. extracting from a "VNx4BI"-mode vector which has
4-bit precision on riscv.

This patch adds a special case for that situation and sets bytenum to
zero as well as bitnum to its proper value.  It works for the riscv
case because in all other situations we can align to a byte boundary.
If x1 were 3 for some reason, however, the above assertion would still
fail.  I don't think this can happen for riscv as we only ever double
the number of chunks for larger vector sizes but not sure about the
general case.

If there's another, correct way to work around feel free to suggest.

Bootstrap/testsuite on aarch64 and x86 is running but I would be
surprised if there were any changes as riscv is the only target that
uses modes with precision < 8.

Regards
 Robin

gcc/ChangeLog:

* expmed.cc (extract_bit_field_1): Handle bitnum with variable
part less than BITS_PER_UNIT.
---
 gcc/expmed.cc | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index e22e43c8505..1b0119f9cfc 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -1858,8 +1858,22 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, 
poly_uint64 bitnum,
  but is useful for things like vector booleans.  */
   if (MEM_P (op0) && !bitnum.is_constant ())
 {
-  bytenum = bits_to_bytes_round_down (bitnum);
-  bitnum = num_trailing_bits (bitnum);
+  /* bits_to_bytes_round_down tries to align to a byte (BITS_PER_UNIT)
+boundary and asserts that bitnum.coeffs[1] % BITS_PER_UNIT == 0.
+For modes with precision < BITS_PER_UNIT this fails but we can
+still extract from the first byte.  */
+  poly_uint16 prec = GET_MODE_PRECISION (outermode);
+  if (prec.coeffs[1] < BITS_PER_UNIT && bitnum.coeffs[1] < BITS_PER_UNIT)
+   {
+ bytenum = 0;
+ bitnum = bitnum.coeffs[0] & (BITS_PER_UNIT - 1);
+   }
+  else
+   {
+ bytenum = bits_to_bytes_round_down (bitnum);
+ bitnum = num_trailing_bits (bitnum);
+   }
+
   poly_uint64 bytesize = bits_to_bytes_round_up (bitnum + bitsize);
   op0 = adjust_bitfield_address_size (op0, BLKmode, bytenum, bytesize);
   op0_mode = opt_scalar_int_mode ();
-- 
2.41.0



Re: [PATCH V3] RISC-V: Refactor and clean expand_cond_len_{unop,binop,ternop}

2023-08-29 Thread Robin Dapp via Gcc-patches
Hi Lehua,

thanks, LGTM now.

Regards
 Robin



Re: [PATCH V4] RISC-V: Enable vec_int testsuite for RVV VLA vectorization

2023-08-28 Thread Robin Dapp via Gcc-patches
> LGTM from my side, but I would like to wait Robin is ok too

In principle I'm OK with it as well, realizing we will still need to fine-tune
a lot here anyway.  For now, IMHO it's good to have some additional test 
coverage
in the vector space but we should not expect every test to be correct/a good 
match
for everything we do yet.  Juzhe mentioned he doesn't want to commit this before
all/most bugs are addresses anyway, right?

Regards
 Robin


Re: [PATCH] RISC-V: Refactor and clean expand_cond_len_{unop,binop,ternop}

2023-08-28 Thread Robin Dapp via Gcc-patches
Hi Lehua,

thanks for starting with the refactoring.  I have some minor comments.

> +/* The value means the number of operands for insn_expander.  */
>  enum insn_type
>  {
>RVV_MISC_OP = 1,
>RVV_UNOP = 2,
> -  RVV_UNOP_M = RVV_UNOP + 2,
> -  RVV_UNOP_MU = RVV_UNOP + 2,
> -  RVV_UNOP_TU = RVV_UNOP + 2,
> -  RVV_UNOP_TUMU = RVV_UNOP + 2,
> +  RVV_UNOP_MASK = RVV_UNOP + 2,

Cleanup up here is good, right now it's not really an insn_type but
indeed just the number of operands.  My original idea was to have an
insn type and a mostly unified expander that performs all necessary
operations depending on the insn_type.  Just to give an idea of why it's
called that way.

> +  rtx ops[RVV_BINOP_MASK] = {target, mask, target, op, sel};
> +  emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MASK, ops);

One of the ideas was that a function emit_vlmax_masked_mu_insn would already
know that it's dealing with a mask and we would just pass something like
RVV_BINOP.  The other way would be to just have emit_vlmax_mu_insn or
something and let the rest be deduced from the insn_type.  Even the vlmax
I intended to have mostly implicit but that somehow got lost during
refactorings :)  No need to change anything for now, just for perspective
again. 

> -/* Expand unary ops COND_LEN_*.  */
> -void
> -expand_cond_len_unop (rtx_code code, rtx *ops)
> +/* Subroutine to expand COND_LEN_* patterns.  */
> +static void
> +expand_cond_len_op (rtx_code code, unsigned icode, int op_num, rtx *cond_ops,
> + rtx len)
>  {

Would you mind renaming op_num (i.e. usually understood as operand_number) into
num_ops or nops? (i.e. number of operands).  That way we would be more in line 
of
what the later expander functions do.

> -  rtx dest = ops[0];
> -  rtx mask = ops[1];
> -  rtx src = ops[2];
> -  rtx merge = ops[3];
> -  rtx len = ops[4];
> +  rtx dest = cond_ops[0];
> +  rtx mask = cond_ops[1];

I would actually prefer to keep "ops" because it's already clear from the
function name that we work with a conditional function (and we don't have
any other ops).

>  
> +/* Expand unary ops COND_LEN_*.  */
> +void
> +expand_cond_len_unop (rtx_code code, rtx *ops)
> +{
> +  rtx dest = ops[0];
> +  rtx mask = ops[1];
> +  rtx src = ops[2];
> +  rtx merge = ops[3];
> +  rtx len = ops[4];
> +
> +  machine_mode mode = GET_MODE (dest);
> +  insn_code icode = code_for_pred (code, mode);
> +  rtx cond_ops[RVV_UNOP_MASK] = {dest, mask, merge, src};
> +  expand_cond_len_op (code, icode, RVV_UNOP_MASK, cond_ops, len);
> +}

We're already a bit inconsistent with how we pasds mask, merge and the source
operands.  Maybe we could also unify this a bit?  I don't have a clear
preference for either, though.

> +  rtx cond_ops[RVV_BINOP_MASK] = {dest, mask, merge, src1, src2};

Here, the merge comes before the sources as well.

> +  rtx cond_ops[RVV_TERNOP_MASK] = {dest, mask, src1, src2, src3, merge};
And here, the merge comes last.  I realize this makes sense in the context
of a ternary operation because the merge is always "real".  As our vector
patterns are similar, maybe we should use this ordering all the time?

Regards
 Robin



Re: [PATCH V3] RISC-V: Enable vec_int testsuite for RVV VLA vectorization

2023-08-28 Thread Robin Dapp via Gcc-patches
On 8/28/23 12:16, Juzhe-Zhong wrote:
> FAIL: gcc.dg/vect/bb-slp-10.c -flto -ffat-lto-objects  scan-tree-dump slp2 
> "unsupported unaligned access"
> FAIL: gcc.dg/vect/bb-slp-10.c scan-tree-dump slp2 "unsupported unaligned 
> access"
> XPASS: gcc.dg/vect/no-scevccp-outer-12.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-16.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-17.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-19.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> XPASS: gcc.dg/vect/no-scevccp-outer-21.c scan-tree-dump-times vect "OUTER 
> LOOP VECTORIZED." 1
> FAIL: gcc.dg/vect/no-scevccp-outer-7.c scan-tree-dump-times vect 
> "vect_recog_widen_mult_pattern: detected" 1
> XPASS: gcc.dg/vect/no-scevccp-outer-8.c scan-tree-dump-times vect "OUTER LOOP 
> VECTORIZED." 1
> FAIL: gcc.dg/vect/no-section-anchors-vect-31.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-64.c scan-tree-dump-times vect 
> "Alignment of access forced using peeling" 2
> FAIL: gcc.dg/vect/no-section-anchors-vect-69.c scan-tree-dump-times vect 
> "vectorized 3 loops" 1
> FAIL: gcc.dg/vect/no-vfa-vect-101.c scan-tree-dump-times vect "can't 
> determine dependence" 1
> FAIL: gcc.dg/vect/no-vfa-vect-102.c scan-tree-dump-times vect "possible 
> dependence between data-refs" 1
> FAIL: gcc.dg/vect/no-vfa-vect-102a.c scan-tree-dump-times vect "possible 
> dependence between data-refs" 1
> FAIL: gcc.dg/vect/no-vfa-vect-37.c scan-tree-dump-times vect "can't determine 
> dependence" 2
> FAIL: gcc.dg/vect/pr57705.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 1 loop" 2
> FAIL: gcc.dg/vect/pr57705.c scan-tree-dump-times vect "vectorized 1 loop" 2
> FAIL: gcc.dg/vect/pr63341-1.c -flto -ffat-lto-objects execution test
> FAIL: gcc.dg/vect/pr63341-1.c execution test
> FAIL: gcc.dg/vect/pr63341-2.c -flto -ffat-lto-objects execution test
> FAIL: gcc.dg/vect/pr63341-2.c execution test
> FAIL: gcc.dg/vect/pr65310.c -flto -ffat-lto-objects  scan-tree-dump vect 
> "can't force alignment"
> FAIL: gcc.dg/vect/pr65310.c -flto -ffat-lto-objects  scan-tree-dump-not vect 
> "misalign = 0"
> FAIL: gcc.dg/vect/pr65310.c scan-tree-dump vect "can't force alignment"
> FAIL: gcc.dg/vect/pr65310.c scan-tree-dump-not vect "misalign = 0"
> FAIL: gcc.dg/vect/pr65518.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 0 loops in function" 2
> FAIL: gcc.dg/vect/pr65518.c scan-tree-dump-times vect "vectorized 0 loops in 
> function" 2
> FAIL: gcc.dg/vect/pr68445.c -flto -ffat-lto-objects  scan-tree-dump vect 
> "vectorizing stmts using SLP"
> FAIL: gcc.dg/vect/pr68445.c scan-tree-dump vect "vectorizing stmts using SLP"
> FAIL: gcc.dg/vect/pr88598-1.c -flto -ffat-lto-objects  scan-tree-dump-not 
> optimized "REDUC_PLUS"
> FAIL: gcc.dg/vect/pr88598-1.c scan-tree-dump-not optimized "REDUC_PLUS"
> FAIL: gcc.dg/vect/pr88598-2.c -flto -ffat-lto-objects  scan-tree-dump-not 
> optimized "REDUC_PLUS"
> FAIL: gcc.dg/vect/pr88598-2.c scan-tree-dump-not optimized "REDUC_PLUS"
> FAIL: gcc.dg/vect/pr88598-3.c -flto -ffat-lto-objects  scan-tree-dump-not 
> optimized "REDUC_PLUS"
> FAIL: gcc.dg/vect/pr88598-3.c scan-tree-dump-not optimized "REDUC_PLUS"
> FAIL: gcc.dg/vect/pr94994.c -flto -ffat-lto-objects execution test
> FAIL: gcc.dg/vect/pr94994.c execution test
> FAIL: gcc.dg/vect/pr97835.c -flto -ffat-lto-objects  scan-tree-dump vect 
> "vectorizing stmts using SLP"
> FAIL: gcc.dg/vect/pr97835.c scan-tree-dump vect "vectorizing stmts using SLP"
> FAIL: gcc.dg/vect/slp-1.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
> "vectorizing stmts using SLP" 4
> FAIL: gcc.dg/vect/slp-1.c scan-tree-dump-times vect "vectorizing stmts using 
> SLP" 4
> FAIL: gcc.dg/vect/slp-11a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-11a.c scan-tree-dump-times vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-12a.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-12a.c scan-tree-dump-times vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-12c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-12c.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorizing stmts using SLP" 0
> FAIL: gcc.dg/vect/slp-12c.c scan-tree-dump-times vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-12c.c scan-tree-dump-times vect "vectorizing stmts 
> using SLP" 0
> FAIL: gcc.dg/vect/slp-15.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
> "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-15.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
> "vectorizing stmts using SLP" 0
> FAIL: gcc.dg/vect/slp-15.c scan-tree-dump-times vect "vectorized 0 loops" 1
> FAIL: gcc.dg/vect/slp-15.c 

Re: [PATCH] RISC-V: Disable user vsetvl fusion into EMPTY block

2023-08-28 Thread Robin Dapp via Gcc-patches
> || vsetvl_insn_p (expr.get_insn ()->rtl ()))
>   continue;
> new_info = expr.global_merge (expr, eg->src->index);
> @@ -3317,6 +3335,25 @@ pass_vsetvl::earliest_fusion (void)
> prob = profile_probability::uninitialized ();
>   }
> else if (!src_block_info.reaching_out.compatible_p (expr)
> +/* We don't do fusion across BBs for user explicit
> +   vsetvl instruction for these following reasons:
> +
> +- The user vsetvl instruction is configured as
> +  no side effects that the previous passes
> +  (GSCE, Loop-invariant, ..., etc)
> +  should be able to do a good job on optimization
> +  of user explicit vsetvls so we don't need to
> +  PRE optimization (The user vsetvls should be
> +  on the optimal local already before this pass)
> +  again for user vsetvls in VSETVL PASS here
> +  (Phase 3 && Phase 4).
> +
> +- Allowing user vsetvls be optimized in PRE
> +  optimization here (Phase 3 && Phase 4) will
> +  complicate the codes so much so we prefer user
> +  vsetvls be optimized in post-optimization
> +  (Phase 5 && Phase 6).  */
> +&& !vsetvl_insn_p (expr.get_insn ()->rtl ())
>  && dest_block_info.probability
>   > src_block_info.probability)

This is OK but do we need the same comment twice?  The first one doesn't
seem to refer to a change but also to vsetvl_insn_p (expr.get_insn()...).

And where is the !empty block property enforced?  I would prefer to have
the longer comment at a top level and shortly describe here why
!vsetvl_insn_p ensures we don't have an EMPTY block.  Don't we usually
have a state for this (i.e. empty_p)?

No need for a V2 though, it can also stay as is and be cleaned up later.

Regards
 Robin


Re: [PATCH V2] RISC-V: Enable vec_int testsuite for RVV VLA vectorization

2023-08-28 Thread Robin Dapp via Gcc-patches
Thanks,

just giving my quick thoughts on some of the FAILs:

> Test report:
> FAIL: gcc.dg/vect/bb-slp-10.c -flto -ffat-lto-objects  scan-tree-dump slp2 
> "unsupported unaligned access"
> FAIL: gcc.dg/vect/bb-slp-10.c scan-tree-dump slp2 "unsupported unaligned 
> access"

For these we would need to add riscv to target_vect_element_align_preferred.
That might depend on uarch, though. 

> FAIL: gcc.dg/vect/bb-slp-70.c (test for excess errors)
> FAIL: gcc.dg/vect/bb-slp-70.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/bb-slp-layout-17.c (test for excess errors)
> FAIL: gcc.dg/vect/bb-slp-layout-17.c -flto -ffat-lto-objects (test for excess 
> errors)

For these we need -Wno-psabi for now.   Besides, I still wanted to provide a
popcount fallback sometime soon.

> FAIL: gcc.dg/vect/pr65310.c -flto -ffat-lto-objects  scan-tree-dump vect 
> "can't force alignment"
> FAIL: gcc.dg/vect/pr65310.c -flto -ffat-lto-objects  scan-tree-dump-not vect 
> "misalign = 0"
> FAIL: gcc.dg/vect/pr65310.c scan-tree-dump vect "can't force alignment"
> FAIL: gcc.dg/vect/pr65310.c scan-tree-dump-not vect "misalign = 0"

Same as above with vect_element_align_preferred.

> XPASS: gcc.dg/vect/vect-10.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 1 loops" 1
> XPASS: gcc.dg/vect/vect-10.c scan-tree-dump-times vect "vectorized 1 loops" 1
> FAIL: gcc.dg/vect/vect-104.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "possible dependence between data-refs" 1
> FAIL: gcc.dg/vect/vect-104.c scan-tree-dump-times vect "possible dependence 
> between data-refs" 1
> FAIL: gcc.dg/vect/vect-109.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "unsupported unaligned access" 2
> FAIL: gcc.dg/vect/vect-109.c scan-tree-dump-times vect "unsupported unaligned 
> access" 2
> XPASS: gcc.dg/vect/vect-24.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "vectorized 3 loops" 1
> XPASS: gcc.dg/vect/vect-24.c scan-tree-dump-times vect "vectorized 3 loops" 1
> FAIL: gcc.dg/vect/vect-26.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Alignment of access forced using peeling" 1
> FAIL: gcc.dg/vect/vect-26.c scan-tree-dump-times vect "Alignment of access 
> forced using peeling" 1
> FAIL: gcc.dg/vect/vect-27.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-27.c scan-tree-dump-times vect "Vectorizing an 
> unaligned access" 1
> FAIL: gcc.dg/vect/vect-29.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-29.c scan-tree-dump-times vect "Vectorizing an 
> unaligned access" 1
> FAIL: gcc.dg/vect/vect-33.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Alignment of access forced using versioning" 1
> FAIL: gcc.dg/vect/vect-33.c scan-tree-dump-times vect "Alignment of access 
> forced using versioning" 1
> FAIL: gcc.dg/vect/vect-72.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-72.c scan-tree-dump-times vect "Vectorizing an 
> unaligned access" 1
> FAIL: gcc.dg/vect/vect-75-big-array.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-75-big-array.c scan-tree-dump-times vect "Vectorizing 
> an unaligned access" 1
> FAIL: gcc.dg/vect/vect-75.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-75.c scan-tree-dump-times vect "Vectorizing an 
> unaligned access" 1
> FAIL: gcc.dg/vect/vect-77-alignchecks.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-77-alignchecks.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-77-global.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-77-global.c scan-tree-dump-times vect "Vectorizing an 
> unaligned access" 1
> FAIL: gcc.dg/vect/vect-78-alignchecks.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-78-alignchecks.c scan-tree-dump-times vect 
> "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-78-global.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "Vectorizing an unaligned access" 1
> FAIL: gcc.dg/vect/vect-78-global.c scan-tree-dump-times vect "Vectorizing an 
> unaligned access" 1
> FAIL: gcc.dg/vect/vect-89-big-array.c -flto -ffat-lto-objects  
> scan-tree-dump-times vect "Alignment of access forced using peeling" 1
> FAIL: gcc.dg/vect/vect-89-big-array.c scan-tree-dump-times vect "Alignment of 
> access forced using peeling" 1
> FAIL: gcc.dg/vect/vect-89.c -flto -ffat-lto-objects  scan-tree-dump-times 
> vect "Alignment of access forced using peeling" 1
> FAIL: gcc.dg/vect/vect-89.c scan-tree-dump-times vect "Alignment of access 
> forced using peeling" 1
> 

Re: [PATCH V2] RISC-V: Add conditional autovec convert(INT<->INT) patterns

2023-08-25 Thread Robin Dapp via Gcc-patches
Hi Lehua,

thanks, LGTM.

One thing maybe for the next patches:  It seems to me that we lump all of
the COND_... tests into the cond subdirectory when IMHO they would also
fit into the respective directories of their operations (binop, unop etc).
Right now we will have a lot of rather unrelated tests (or just related
by their use of COND_) in one dir.  What do you think?  

Regards
 Robin


Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-24 Thread Robin Dapp via Gcc-patches
Ping.  I refined the code and some comments a bit and added a test
case.

My question in general would still be:  Is this something we want
given that we potentially move some of combine's work a bit towards
the front of the RTL pipeline?

Regards
 Robin

Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.

This patch enables the forwarding of UNARY_P sources.  As this
involves potentially replacing a vector register with a scalar register
the ira_hoist_pressure machinery is used to calculate the change in
register pressure.  If the propagation would increase the pressure
beyond the number of hard regs, we don't perform it.

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
---
 gcc/fwprop.cc | 314 +-
 .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
 2 files changed, 371 insertions(+), 7 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..b49d4e4ced4 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@ using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  */
+sbitmap pressure_accounted;
+
 /* Do not try to replace constant addresses or addresses of local and
argument slots.  These MEM expressions are made only once and inserted
in many instructions, as well as being used to control symbol table
@@ -181,6 +189,8 @@ namespace
 bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
 bool folded_to_constants_p () const;
 bool profitable_p () const;
+bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
+bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
 
 bool check_mem (int, rtx) final override;
 void note_simplification (int, uint16_t, rtx, rtx) final override;
@@ -332,25 +342,247 @@ fwprop_propagation::profitable_p () const
   && (result_flags & PROFITABLE))
 return true;
 
-  if (REG_P (to))
+  /* Only continue with an unary operation if we consider register
+ pressure.  */
+  rtx what = copy_rtx (to);
+  if (UNARY_P (what) && flag_ira_hoist_pressure)
+what = XEXP (what, 0);
+
+  if (REG_P (what))
 return true;
 
-  if (GET_CODE (to) == SUBREG
-  && REG_P (SUBREG_REG (to))
-  && !paradoxical_subreg_p (to))
+  if (GET_CODE (what) == SUBREG
+  && REG_P (SUBREG_REG (what))
+  && !paradoxical_subreg_p (what))
 return true;
 
-  if (CONSTANT_P (to))
+  if (CONSTANT_P (what))
 return true;
 
   return false;
 }
 
-/* Check that X has a single def.  */
+/* Check if the register pressure in any predecessor block of USE's block
+   until DEF's block is equal or higher to the number of hardregs in NU's
+   register class.  */
+bool
+fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
+ rtx_insn *use) const
+{
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), _nregs);
+  old_class
+= regpressure_get_regno_pressure_class (REGNO (old), _nregs);
+
+  if (nu_class == NO_REGS && old_class == NO_REGS)
+return true;
+
+  if (nu_class == old_class)
+return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  basic_block bb;
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec q;
+  q.safe_push (bbto);
+
+  while (!q.is_empty ())
+{
+  bb = q.pop ();
+
+  if (bitmap_bit_p (visited, bb->index))
+   continue;
+
+  /* Nothing to do if the register to be replaced is not live
+in this BB.  */
+  if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+   continue;
+
+  /* Nothing to do if the replacement register is already live in
+this BB.  */
+ 

Re: [PATCH] tree-optimization/111115 - SLP of masked stores

2023-08-24 Thread Robin Dapp via Gcc-patches
This causes an ICE in
gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-11.c
(internal compiler error: in get_group_load_store_type, at 
tree-vect-stmts.cc:2121)

#include 

#define TEST_LOOP(DATA_TYPE, INDEX_TYPE)   \
  void __attribute__ ((noinline, noclone)) \
  f_##DATA_TYPE##_##INDEX_TYPE (DATA_TYPE *restrict y, DATA_TYPE *restrict x,  \
INDEX_TYPE *restrict index,\
INDEX_TYPE *restrict cond) \
  {\
for (int i = 0; i < 100; ++i)  \
  {\
if (cond[i * 2])   \
  y[i * 2] = x[index[i * 2]] + 1;  \
if (cond[i * 2 + 1])   \
  y[i * 2 + 1] = x[index[i * 2 + 1]] + 2;  \
  }\
  }

TEST_LOOP (int8_t, int8_t)

Is there now a mismatch with the LEN_ IFNs somewhere?

Regards
 Robin


Re: [PATCH] RISC-V: Add COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS testcases

2023-08-24 Thread Robin Dapp via Gcc-patches
OK.

Regards
 Robin



Re: [PATCH V2] RISC-V: Support LEN_FOLD_EXTRACT_LAST auto-vectorization

2023-08-24 Thread Robin Dapp via Gcc-patches
LGTM.

Regards
 Robin


Re: [PATCH] RISC-V: Add conditional sign/zero extension and truncation autovec patterns

2023-08-24 Thread Robin Dapp via Gcc-patches


> Yes, it's better to call it one_quad.

I'd suggest to go with quarter as before or quarter_width_op
or something.

>> Is this necessary for recognizing a different pattern?
> 
> Are you saying that the testcases xxx-1 and xxx-2 are duplicated? If
> so, I have no problem removing it and just keeping xxx-1 testcase
> since it is still possible to cover my code.

I was just curious why the NEW_TYPE bi = b[i] was necessary instead
of using b[i] directly.

Regards
 Robin



Re: [PATCH] RISC-V: Add conditional sign/zero extension and truncation autovec patterns

2023-08-24 Thread Robin Dapp via Gcc-patches
Hi Lehua,

thanks, just tiny non-functional nits.

> -  rtx ops[] = {operands[0], quarter};
> -  icode = code_for_pred_trunc (mode);
> -  riscv_vector::emit_vlmax_insn (icode, riscv_vector::RVV_UNOP, ops);
> +  rtx half = gen_reg_rtx (mode);

Not really a half anymore now? :)

> +#include 
> +
> +#define DEF_LOOP(OLD_TYPE, NEW_TYPE) 
>   \
> +  void __attribute__ ((noipa))   
>   \
> +  test_##OLD_TYPE##_2_##NEW_TYPE (NEW_TYPE *__restrict r,
>   \
> +   OLD_TYPE *__restrict a,  \
> +   NEW_TYPE *__restrict b,  \
> +   OLD_TYPE *__restrict pred, int n)\
> +  {  
>   \
> +for (int i = 0; i < n; ++i)  
>   \
> +  {  
>   \
> + NEW_TYPE bi = b[i];\

Is this necessary for recognizing a different pattern?

> +/* wider-width Integer Type => Integer Type */

Isn't it the other way around or am I just confused?

> +/* narrower-width Integer Type => Integer Type */
> +#define TEST_ALL_X2X_NARROWER(T) 
>   \
> +  T (uint16_t, uint8_t)  
>   \

Same here.

Regards
 Robin



Re: [PATCH] RISC-V: Support LEN_FOLD_EXTRACT_LAST auto-vectorization

2023-08-24 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

>   vcpop.m a5,v0
>   beq a5,zero,.L3
>   addia5,a5,-1
>   vsetvli a4,zero,e32,m1,ta,ma
>   vcompress.vmv2,v3,v0
>   vslidedown.vx   v2,v2,a5
>   vmv.x.s a0,v2
> .L3:
>   sext.w  a0,a0

Mhm, where is this sext coming from?  Thought I had this covered with
the autovec-opt pattern but apparently not.  I'll take that, nothing
related to this patch.

> --- a/gcc/config/riscv/riscv-v.cc
> +++ b/gcc/config/riscv/riscv-v.cc
> @@ -213,7 +213,7 @@ public:
> {
>   /* Optimize VLS-VLMAX code gen, we can use vsetivli instead of
>  the vsetvli to obtain the value of vlmax.  */
> - poly_uint64 nunits = GET_MODE_NUNITS (m_dest_mode);
> + poly_uint64 nunits = GET_MODE_NUNITS (m_mask_mode);

Why is that necessary?  Just for the popcount I presume?
Can't we rather have a new case for a scalar destination?  I find
the code a bit misleading now as we check m_dest_mode and then not
use it.

>  
> +/* Emit vcpop.m instruction.  */
> +
> +static void
> +emit_cpop_insn (unsigned icode, rtx *ops, rtx len)
> +{
> +  machine_mode dest_mode = GET_MODE (ops[0]);
> +  machine_mode mask_mode = GET_MODE (ops[1]);
> +  insn_expander e (RVV_CPOP,
> +   /* HAS_DEST_P */ true,
> +   /* FULLY_UNMASKED_P */ true,
> +   /* USE_REAL_MERGE_P */ true,
> +   /* HAS_AVL_P */ true,
> +   /* VLMAX_P */ len ? false : true,
> +   dest_mode, mask_mode);
> +
> +  e.set_vl (len);
> +  e.emit_insn ((enum insn_code) icode, ops);
> +}

The use_real_merge just appeared odd to me here because there is
nothing to merge.  But in the end it's just to omit the vundef operand
so good for now.  There is an increasing number of opportunities to
refactor in riscv-v.cc, though ;)

The rest looks good to me.  Note that my machine crashed when
compiling the extract_last-14.c because it used up all my RAM.
The vsetvl "refactor" phase 3 patch helped, though.
We'd need to have this patch depend on the other one then.

The rest looks good to me.  At first I was a bit wary about the
branching zero check after popcount but as we're outside of a loop
anyway, that's fine.  Might want to use a conditional select in the
future but actually not that important. 

Regards
 Robin


Re: [PATCH] RISC-V: Add initial pipeline description for an out-of-order core.

2023-08-23 Thread Robin Dapp via Gcc-patches
> Does this patch fix these 2 following PR:
> 108271 – Missed RVV cost model (gnu.org) 
> 
> 108412 – RISC-V: Negative optimization of GCSE && LOOP INVARIANTS (gnu.org) 
> 
> 
> If yes, plz append these 2 cases into testsuite and indicate those 2 PR are 
> fixed.
> So that we can close them.

The second one is fixed on my local branch, the first not yet because there
is more to it still.  The second one is more due to pressure-aware scheduling
and I'm going to add it to the commit as well as the PR to the commit once this
is verified.

Regards
 Robin


[PATCH] RISC-V: Add initial pipeline description for an out-of-order core.

2023-08-23 Thread Robin Dapp via Gcc-patches
Hi,

this adds a pipeline description for a generic out-of-order core.
Latency and units are not based on any real processor but more or less
educated guesses what such a processor could look like.
For the lack of a better name, I called the -mtune parameter "generic-ooo".

In order to account for latency scaling by LMUL != 1, sched_adjust_cost
is implemented.  It will scale an instruction's latency by its LMUL
so an LMUL == 8 instruction will take 8 times the number of cycles
the same instruction with LMUL == 1 would take.
As this potentially causes very high latencies which, in turn, might
lead to scheduling anomalies and a higher number of vsetvls emitted,
this feature is only enabled when specifying -madjust-lmul-cost.

Additionally, in order to easily recognize pre-RA vsetvls this patch
introduces an insn type vsetvl_pre which is used in sched_adjust_cost.

As mentioned, the latency numbers are guesswork at best.  I assumed
6-wide issue as most public announcements point into that direction
and obviously everything else is similarly coarse.  Feel free to
correct in case I unnecessarily pessimized or underestimated something.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv-cores.def (RISCV_TUNE): Add parameter.
* config/riscv/riscv-opts.h (enum riscv_microarchitecture_type):
Add generic_ooo.
* config/riscv/riscv.cc (riscv_sched_adjust_cost): Implement
scheduler hook.
(TARGET_SCHED_ADJUST_COST): Define.
* config/riscv/riscv.md (no,yes"): Include generic-ooo.md
* config/riscv/riscv.opt: Add -madjust-lmul-cost.
* config/riscv/generic-ooo.md: New file.
* config/riscv/vector.md: Add vsetvl_pre.
---
 gcc/config/riscv/generic-ooo.md  | 284 +++
 gcc/config/riscv/riscv-cores.def |   1 +
 gcc/config/riscv/riscv-opts.h|   3 +-
 gcc/config/riscv/riscv.cc|  87 ++
 gcc/config/riscv/riscv.md|   5 +-
 gcc/config/riscv/riscv.opt   |   3 +
 gcc/config/riscv/vector.md   |   4 +-
 7 files changed, 383 insertions(+), 4 deletions(-)
 create mode 100644 gcc/config/riscv/generic-ooo.md

diff --git a/gcc/config/riscv/generic-ooo.md b/gcc/config/riscv/generic-ooo.md
new file mode 100644
index 000..78b9e48f935
--- /dev/null
+++ b/gcc/config/riscv/generic-ooo.md
@@ -0,0 +1,284 @@
+;; RISC-V generic out-of-order core scheduling model.
+;; Copyright (C) 2017-2023 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+;;
+;; GCC is free software; you can redistribute it and/or modify it
+;; under the terms of the GNU General Public License as published by
+;; the Free Software Foundation; either version 3, or (at your option)
+;; any later version.
+;;
+;; GCC is distributed in the hope that it will be useful, but
+;; WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;; General Public License for more details.
+;;
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; .
+
+(define_automaton "generic_ooo")
+
+;; Regarding functional units we assume a three-way split:
+;; - Integer ALU (IXU) - 4 symmetric units.
+;; - Floating-point (FXU) - 2 symmetric units.
+;; - Vector Unit (VXU) - 1 unit.
+
+;; We assume 6-wide issue:
+;; - 5-wide generic/integer issue.
+;; - 1-wide vector issue.
+
+;; For now, the only subunits are for non-pipelined integer division and
+;; vector div/mult/sqrt.
+;; No extra units for e.g. vector permutes, masking, everything is assumed to
+;; be on the same pipelined execution unit.
+
+;; Latency:
+;; - Regular integer operations take 1 cycle.
+;; - Multiplication/Division take multiple cycles.
+;; - Float operations take 4-6 cycles.
+;; - Regular vector operations take 2-6 cycles.
+;;   (This assumes LMUL = 1, latency for LMUL = 2, 4, 8 is scaled accordingly
+;;by riscv_sched_adjust_cost when -madjust-lmul-cost is given)
+;; - Load/Store:
+;;   - To/From IXU: 4 cycles.
+;;   - To/From FXU: 6 cycles.
+;;   - To/From VXU: 6 cycles.
+
+;; Integer/float issue queues.
+(define_cpu_unit "issue0,issue1,issue2,issue3,issue4" "generic_ooo")
+
+;; Separate issue queue for vector instructions.
+(define_cpu_unit "generic_ooo_vxu_issue" "generic_ooo")
+
+;; Integer/float execution units.
+(define_cpu_unit "ixu0,ixu1,ixu2,ixu3" "generic_ooo")
+(define_cpu_unit "fxu0,fxu1" "generic_ooo")
+
+;; Integer subunit for division.
+(define_cpu_unit "generic_ooo_div" "generic_ooo")
+
+;; Vector execution unit.
+(define_cpu_unit "generic_ooo_vxu_alu" "generic_ooo")
+
+;; Vector subunit that does mult/div/sqrt.
+(define_cpu_unit "generic_ooo_vxu_multicycle" "generic_ooo")
+
+;; Shortcuts
+(define_reservation "generic_ooo_issue" "issue0|issue1|issue2|issue3|issue4")
+(define_reservation "generic_ooo_ixu_alu" "ixu0|ixu1|ixu2|ixu3")
+(define_reservation "generic_ooo_fxu" "fxu0|fxu1")
+
+

Re: [PATCH V2] RISC-V: Add conditional unary neg/abs/not autovec patterns

2023-08-23 Thread Robin Dapp via Gcc-patches
OK, thanks.

Regards
 Robin


Re: [PATCH] RISC-V: Add conditional unary neg/abs/not autovec patterns

2023-08-22 Thread Robin Dapp via Gcc-patches
Hi Lehua,

no concerns here, just tiny remarks but in general LGTM as is.

> +(define_insn_and_split "*copysign_neg"
> +  [(set (match_operand:VF 0 "register_operand")
> +(neg:VF
> +  (unspec:VF [
> +(match_operand:VF 1 "register_operand")
> +(match_operand:VF 2 "register_operand")
> +  ] UNSPEC_VCOPYSIGN)))]
> +  "TARGET_VECTOR && can_create_pseudo_p ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)]
> +{
> +  riscv_vector::emit_vlmax_insn (code_for_pred_ncopysign (mode),
> + riscv_vector::RVV_BINOP, operands);
> +  DONE;
> +})

It's a bit unfortunate that we need this now but well, no way around it.

> -  emit_insn (gen_vcond_mask (vmode, vmode, d->target, d->op0, d->op1, mask));
> +  /* swap op0 and op1 since the order is opposite to pred_merge.  */
> +  rtx ops2[] = {d->target, d->op1, d->op0, mask};
> +  emit_vlmax_merge_insn (code_for_pred_merge (vmode), 
> riscv_vector::RVV_MERGE_OP, ops2);
>return true;
>  }

This seems a separate, general fix that just surfaced in the course of
this patch?  Would be nice to have this factored out but as we already have
it, no need I guess.

> +  if (is_dummy_mask)
> +{
> +  /* Use TU, MASK ANY policy.  */
> +  if (needs_fp_rounding (code, mode))
> + emit_nonvlmax_fp_tu_insn (icode, RVV_UNOP_TU, cond_ops, len);
> +  else
> + emit_nonvlmax_tu_insn (icode, RVV_UNOP_TU, cond_ops, len);
> +}

We have quite a bit of code duplication across the expand_cond_len functions
now (binop, ternop, unop).  Not particular to your patch but I'd suggest to
unify this later. 

> +TEST_ALL (DEF_LOOP)
> +
> +/* NOTE: int abs operator is converted to vmslt + vneg.v */
> +/* { dg-final { scan-assembler-times {\tvneg\.v\tv[0-9]+,v[0-9]+,v0\.t} 12 { 
> xfail { any-opts "--param riscv-autovec-lmul=m2" } } } } */

Why does this fail with LMUL == 2 (also in the following tests)?  A comment
would be nice here.

Regards
 Robin



Re: [PATCH] RISC-V: Add conditional unary neg/abs/not autovec patterns

2023-08-22 Thread Robin Dapp via Gcc-patches
> What about conditional zero_extension, sign_extension,
> float_extension, ...etc?
> 
> We have discussed this, we can have some many conditional situations
> that can be supported by either match.pd or rtl backend combine
> pass.
> 
> IMHO, it will be too many optabs/internal fns if we support all of
> them in match.pd? Feel free to correct me I am wrong.
I think the general trend is (and should be) to push things forward
in the pipeline and not just have combine fix it.  However, for now
this would complicate things and therefore I agree with the approach
the patch takes.  I'd rather have the patterns in now rather than change
the middle end for unclear benefit.  

IMHO long-term we want things to be optimized early but short-term
combine is good enough.  We can then move optimizations forward on a
case-by-case basis.

Regards
 Robin


Re: RISCV test infrastructure for d / v / zfh extensions

2023-08-21 Thread Robin Dapp via Gcc-patches
Hi Joern.

> Hmm, you are right.  I personally prefer my version because it allows
> consistent naming of the
> different tests, also easily extendible when new extensions need testing.
> Although the riscv_vector name has the advantage that it is better
> legible for people who are
> not used to dealing with RISC_V extension names.  If we keep
> riscv_vector, it would make
> sense to name the other tests also something more verbose, e.g. change
> riscv_d into
> riscv_double_fp or even riscv_double_precision_floating_point .
> It would be nice to hear other people's opinions on the naming.

I can live with either with a preference for your naming scheme, i.e. 
calling the extensions directly by their name for consistency reasons.
A more verbose scheme might lead to misconceptions later in case we
have several closely related extensions.  There will probably already be
ample discussion during ratification about naming and IMHO we should
not repeat that just to make names more accessible.  If needed we can
still add comments in the respective tests to clarify.
Vector is usually special among architecture extensions but we're not
even consistent with naming in the source itself, so...  

>> Would it make sense to skip the first check here
>> (check_effective_target_riscv_v) so we have a proper runtime check?
> 
> My starting point was that the changing of global testsuite variables around -
> as the original RISC-V vector patches did - is wrong.  The user asked to test
> a particular target (or set targets, for multilibs), and that target
> is the one to test,
> so we can't just assume it has other hardware features that are not implied by
> the target.
> Contrarily, the target that the user requested to test can be assumed to be
> available for testing.  Testing that it actually works is a part of
> the point of the
> test.  If I ask for a dejagnu test for a target that has vector support, I 
> would
> hope that the vector support is also tested, not backing off if it finds that
> there is a problem with the target,
> The way I look at things, when the macro  __riscv_v is defined,
> the compiler asserts that it is compiling for a target that has vector 
> support,
> because it was instructed by configuration / options to emit code for that
> target.  Which we can take as evidence that dejagnu is run with options
> to select that target (either explicitly or by default due to the
> configuration of
> the compiler under test)

Yes, I largely agree with that.  Where I was coming from is that several other
effective target checks will not short circuit the check but always perform it
fully (i.e. interpreting the effective target as the full chain up to 
execution).
Yet, I can see the appeal of the short circuit as well and in the end it really
doesn't matter all that much.

I would have preferred to replace the existing checks right away in order to
immediately have proper coverage but let's not dwell on that, therefore
LGTM, thanks. 

Regards
 Robin


Re: [PATCH] RISC-V: Refactor Phase 3 (Demand fusion) of VSETVL PASS

2023-08-21 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

thanks, this is a reasonable approach and improves readability noticeably.
LGTM but I'd like to wait for other opinions (e.g. by Kito) as I haven't
looked closely into the vsetvl pass before and cannot entirely review it
quickly.  As we already have good test coverage there is not much that
can go wrong IMHO.

Regards
 Robin


[PATCH] RISC-V: Allow immediates 17-31 for vector shift.

2023-08-18 Thread Robin Dapp via Gcc-patches
Hi,

this patch adds a missing constraint check in order to be able to
print (and not ICE) vector immediates 17-31 for vector shifts.

Regards
 Robin

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_print_operand):

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/shift-immediate.c: New test.
---
 gcc/config/riscv/riscv.cc|  3 ++-
 .../riscv/rvv/autovec/binop/shift-immediate.c| 16 
 2 files changed, 18 insertions(+), 1 deletion(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/shift-immediate.c

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 49062bef9fc..0f60ffe5f60 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -4954,7 +4954,8 @@ riscv_print_operand (FILE *file, rtx op, int letter)
else if (satisfies_constraint_Wc0 (op))
  asm_fprintf (file, "0");
else if (satisfies_constraint_vi (op)
-|| satisfies_constraint_vj (op))
+|| satisfies_constraint_vj (op)
+|| satisfies_constraint_vk (op))
  asm_fprintf (file, "%wd", INTVAL (elt));
else
  output_operand_lossage ("invalid vector constant");
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/shift-immediate.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/shift-immediate.c
new file mode 100644
index 000..a2e1c33f4fa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/shift-immediate.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-std=c99 -march=rv32gcv -mabi=ilp32d -O2 
--param=riscv-autovec-preference=scalable" } */
+
+#define uint8_t unsigned char
+
+void foo1 (uint8_t *a)
+{
+uint8_t b = a[0];
+int val = 0;
+
+for (int i = 0; i < 4; i++)
+{
+a[i] = (val & 1) ? (-val) >> 17 : val;
+val += b;
+}
+}
-- 
2.41.0



[PATCH] RISC-V/testsuite: Add missing conversion tests.

2023-08-18 Thread Robin Dapp via Gcc-patches
Hi,

this patch adds some missing tests for vf[nw]cvt.

Regards
 Robin

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-run.c:
Add tests.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-rv32gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-rv64gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-template.h:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-itof-rv32gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-itof-rv64gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-itof-template.h:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfncvt-itof-zvfh-run.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-ftoi-rv32gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-ftoi-rv64gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-ftoi-template.h:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-ftoi-zvfh-run.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-itof-run.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-itof-rv32gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-itof-rv64gcv.c:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-itof-template.h:
Ditto.
* gcc.target/riscv/rvv/autovec/conversions/vfwcvt-itof-zvfh-run.c:
Ditto.
---
 .../rvv/autovec/conversions/vfncvt-ftoi-run.c | 96 +++
 .../autovec/conversions/vfncvt-ftoi-rv32gcv.c |  6 +-
 .../autovec/conversions/vfncvt-ftoi-rv64gcv.c |  6 +-
 .../conversions/vfncvt-ftoi-template.h|  6 ++
 .../autovec/conversions/vfncvt-itof-rv32gcv.c |  1 +
 .../autovec/conversions/vfncvt-itof-rv64gcv.c |  4 +-
 .../conversions/vfncvt-itof-template.h|  5 +-
 .../conversions/vfncvt-itof-zvfh-run.c| 32 +++
 .../autovec/conversions/vfwcvt-ftoi-rv32gcv.c |  4 +-
 .../autovec/conversions/vfwcvt-ftoi-rv64gcv.c |  4 +-
 .../conversions/vfwcvt-ftoi-template.h|  2 +
 .../conversions/vfwcvt-ftoi-zvfh-run.c| 32 +++
 .../rvv/autovec/conversions/vfwcvt-itof-run.c | 96 +++
 .../autovec/conversions/vfwcvt-itof-rv32gcv.c |  4 +-
 .../autovec/conversions/vfwcvt-itof-rv64gcv.c |  4 +-
 .../conversions/vfwcvt-itof-template.h| 10 +-
 .../conversions/vfwcvt-itof-zvfh-run.c| 10 +-
 17 files changed, 302 insertions(+), 20 deletions(-)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-run.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-run.c
index ce3fcfa9af8..73eda067ba3 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-run.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/conversions/vfncvt-ftoi-run.c
@@ -62,6 +62,38 @@ main ()
   RUN2 (float, uint16_t, 4096)
   RUN2 (float, uint16_t, 5975)
 
+  RUN (float, int8_t, 3)
+  RUN (float, int8_t, 4)
+  RUN (float, int8_t, 7)
+  RUN (float, int8_t, 99)
+  RUN (float, int8_t, 119)
+  RUN (float, int8_t, 128)
+  RUN (float, int8_t, 256)
+  RUN (float, int8_t, 279)
+  RUN (float, int8_t, 555)
+  RUN (float, int8_t, 1024)
+  RUN (float, int8_t, 1389)
+  RUN (float, int8_t, 2048)
+  RUN (float, int8_t, 3989)
+  RUN (float, int8_t, 4096)
+  RUN (float, int8_t, 5975)
+
+  RUN2 (float, uint8_t, 3)
+  RUN2 (float, uint8_t, 4)
+  RUN2 (float, uint8_t, 7)
+  RUN2 (float, uint8_t, 99)
+  RUN2 (float, uint8_t, 119)
+  RUN2 (float, uint8_t, 128)
+  RUN2 (float, uint8_t, 256)
+  RUN2 (float, uint8_t, 279)
+  RUN2 (float, uint8_t, 555)
+  RUN2 (float, uint8_t, 1024)
+  RUN2 (float, uint8_t, 1389)
+  RUN2 (float, uint8_t, 2048)
+  RUN2 (float, uint8_t, 3989)
+  RUN2 (float, uint8_t, 4096)
+  RUN2 (float, uint8_t, 5975)
+
   RUN (double, int32_t, 3)
   RUN (double, int32_t, 4)
   RUN (double, int32_t, 7)
@@ -93,4 +125,68 @@ main ()
   RUN2 (double, uint32_t, 3989)
   RUN2 (double, uint32_t, 4096)
   RUN2 (double, uint32_t, 5975)
+
+  RUN (double, int16_t, 3)
+  RUN (double, int16_t, 4)
+  RUN (double, int16_t, 7)
+  RUN (double, int16_t, 99)
+  RUN (double, int16_t, 119)
+  RUN (double, int16_t, 128)
+  RUN (double, int16_t, 256)
+  RUN (double, int16_t, 279)
+  RUN (double, int16_t, 555)
+  RUN (double, int16_t, 1024)
+  RUN (double, int16_t, 1389)
+  RUN (double, int16_t, 2048)
+  RUN (double, int16_t, 3989)
+  RUN (double, int16_t, 4096)
+  RUN (double, int16_t, 5975)
+
+  RUN2 (double, uint16_t, 3)
+  RUN2 (double, uint16_t, 4)
+  RUN2 (double, uint16_t, 7)
+  RUN2 (double, uint16_t, 99)
+  RUN2 (double, uint16_t, 119)
+  RUN2 (double, uint16_t, 128)
+  RUN2 (double, uint16_t, 256)
+  RUN2 (double, uint16_t, 279)
+  RUN2 (double, uint16_t, 555)
+  RUN2 (double, uint16_t, 1024)
+  RUN2 (double, uint16_t, 1389)
+  RUN2 

[PATCH] RISC-V: Enable pressure-aware scheduling by default.

2023-08-18 Thread Robin Dapp via Gcc-patches
Hi,

this patch enables pressure-aware scheduling for riscv.  There have been
various requests for it so I figured I'd just go ahead and send
the patch.

There is some slight regression in code quality for a number of
vector tests where we spill more due to different instructions order.
The ones I looked at were a mix of bad luck and/or brittle tests.
Comparing the size of the generated assembly or the number of vsetvls
for SPECint also didn't show any immediate benefit but that's obviously
not a very fine-grained analysis.

As cost and scheduling models mature I expect the situation to improve
and for now I think it's generally favorable to enable pressure-aware
scheduling so we can work with it rather than trying to find every
possible problem in advance.  Any other opinions on that?

Regards
 Robin

This patch enables register -fsched-pressure by default and sets
the algorithm to "model".  As with other backends, this helps
reduce unnecessary spills.

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc: Add -fsched-pressure.
* config/riscv/riscv.cc (riscv_option_override): Set sched
pressure algorithm.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/narrow_constraint-1.c: Add
-fno-sched-pressure.
* gcc.target/riscv/rvv/base/narrow_constraint-17.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-18.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-19.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-20.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-21.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-22.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-23.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-24.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-25.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-26.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-27.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-28.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-29.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-30.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-31.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-4.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-5.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-8.c: Ditto.
* gcc.target/riscv/rvv/base/narrow_constraint-9.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-10.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-11.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-12.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-3.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-9.c: Ditto.
---
 gcc/common/config/riscv/riscv-common.cc  | 2 ++
 gcc/config/riscv/riscv.cc| 5 +
 .../gcc.target/riscv/rvv/base/narrow_constraint-1.c  | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-17.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-18.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-19.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-20.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-21.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-22.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-23.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-24.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-25.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-26.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-27.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-28.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-29.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-30.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-31.c | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-4.c  | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-5.c  | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-8.c  | 2 +-
 .../gcc.target/riscv/rvv/base/narrow_constraint-9.c  | 2 +-
 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-10.c | 2 +-
 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-11.c | 2 +-
 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-12.c | 2 +-
 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-3.c  | 2 +-
 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vlmax_bb_prop-9.c  | 2 +-
 27 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 4737dcd44a1..59848b21162 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ 

Re: [PATCH] RISC-V: Fix -march error of zhinxmin testcases

2023-08-18 Thread Robin Dapp via Gcc-patches
> This little patch fixs the -march error of a zhinxmin testcase I added earlier
> and an old zhinxmin testcase, since these testcases are for zhinxmin extension
> and not zfhmin extension.

Arg, I should have noticed that ;)
OK, of course.

Regards
 Robin


Re: [PATCH V2] RISC-V: Add the missed half floating-point mode patterns of local_pic_load/store when only use zfhmin or zhinxmin

2023-08-17 Thread Robin Dapp via Gcc-patches
Indeed all ANYLSF patterns have TARGET_HARD_FLOAT (==f extension) which
is incompatible with ZHINX or ZHINXMIN anyway.  That should really be fixed
separately or at least clarified, maybe I'm missing something.

Still we can go forward with the patch itself as it improves things
independently, so LGTM.

Regards
 Robin


Re: [PATCH V2] RISC-V: Forbidden fuse vlmax vsetvl to DEMAND_NONZERO_AVL vsetvl

2023-08-17 Thread Robin Dapp via Gcc-patches
OK, thanks.

Regards
 Robin


Re: [PATCH] RISC-V: Forbidden fuse vlmax vsetvl to DEMAND_NONZERO_AVL vsetvl

2023-08-17 Thread Robin Dapp via Gcc-patches
Hi Lehua,

> XPASS: gcc.target/riscv/rvv/autovec/partial/slp-1.c scan-assembler \\tvand
> XPASS: gcc.target/riscv/rvv/autovec/partial/slp-1.c scan-assembler \\tvand
> XPASS: gcc.target/riscv/rvv/autovec/partial/slp-1.c scan-assembler \\tvand
> XPASS: gcc.target/riscv/rvv/autovec/partial/slp-1.c scan-assembler \\tvand

Thanks for checking, I know about those but have other FAILs.  Probably
due to a recent update or so, need to check.

> This is because running a testcase with spike+pk will result in an
> ILLEGAL INSTRUCTION error if the vtype registers are not initialized
> before executing vmv1r.v instruction. This case fails because of this reason,
> so explicitly execute vsetvl early. We are currently discussing with Kito to
> constrain this case in psABI and ask the execution environment(pk) to ensure
> that vtype is initialized, but not so fast. So when encountering a testcase 
> that
> fails because of this reason, I think use this way to fix it is ok.

Hmm, ok so that has nothing to do with the rest of the patch but just
happend to be the same test case.
So we didn't schedule a vsetvl here because vmv1r doesn't require
one but the simulation doesn't initialize vtype before the first vsetvl?
If this is the only instance, I guess that's OK, but please add a comment
as well.

OK with the two comments added.

Regards
 Robin


Re: [PATCH] RISC-V: Add the missed half floating-point mode patterns of local_pic_load/store when only use zfhmin

2023-08-17 Thread Robin Dapp via Gcc-patches
Hi Lehua,

thanks for fixing this.  Looks like the same reason we have the
separation of zvfh and zvfhmin for vector loads/stores.

> +;; Iterator for hardware-supported load/store floating-point modes.
> +(define_mode_iterator ANYLSF [(SF "TARGET_HARD_FLOAT || TARGET_ZFINX")
> +   (DF "TARGET_DOUBLE_FLOAT || TARGET_ZDINX")
> +   (HF "TARGET_ZFHMIN || TARGET_ZHINX")])
> +

I first thought we needed TARGET_ZFH here as well but it appears that
TARGET_ZFH implies TARGET_ZFHMIN via riscv_implied_info.  We're lacking
that on the vector side and this should be addressed separately.

You likely want TARGET_ZHINXMIN instead of ZHINX though?  I mean the
hardware support is obviously always there but the patterns should
be available for the min extension already.  Please double check as
I haven't worked with that extension before.
Our test coverage for the *inx extensions is honestly a bit sparse,
maybe you would also want to add a testcase for a similar scenario?

> -;; We can support ANYF loads into X register if there is no double support
> +;; We can support ANYLSF loads into X register if there is no double support
>  ;; or if the target is 64-bit> -(define_insn "*local_pic_load"
> -  [(set (match_operand:ANYF 0 "register_operand" "=f,*r")
> - (mem:ANYF (match_operand 1 "absolute_symbolic_operand" "")))
> +(define_insn "*local_pic_load"
> +  [(set (match_operand:ANYLSF 0 "register_operand" "=f,*r")
> + (mem:ANYLSF (match_operand 1 "absolute_symbolic_operand" "")))
> (clobber (match_scratch:P 2 "=r,X"))]
>"TARGET_HARD_FLOAT && USE_LOAD_ADDRESS_MACRO (operands[1])
> && (!TARGET_DOUBLE_FLOAT || TARGET_64BIT)"
>"@
> -   \t%0,%1,%2
> +   \t%0,%1,%2
> \t%0,%1"
>[(set (attr "length") (const_int 8))])

Unrelated to your patch - but from a quick glimpse here I didn't see
why we require TARGET_HARD_FLOAT for the softload alternatives.  Aren't
zdinx, zfinx, zhinx a bit of a SOFT_FLOAT thing?  Well probably just
semantics... 

Apart from that LGTM.

Regards
 Robin



Re: [PATCH] RISC-V: Forbidden fuse vlmax vsetvl to DEMAND_NONZERO_AVL vsetvl

2023-08-17 Thread Robin Dapp via Gcc-patches
Hi Lehua,

unrelated but I'm seeing a lot of failing gather/scatter tests on
master right now.

> /* DIRTY -> DIRTY or VALID -> DIRTY.  */
> +   if (block_info.reaching_out.demand_p (DEMAND_NONZERO_AVL)
> +   && vlmax_avl_p (prop.get_avl ()))
> + continue;
> vector_insn_info new_info; 

Please add a small comment here which exact situation we're trying
to prevent.

> +asm volatile ("vsetivli x0, 0, e8, m1, ta, ma");

Why is this necessary or rather why is vtype uninitialized?  Is
this the mentioned bug?  If so, why do we still need it with the
vsetvl fix? 

Regards
 Robin



Re: [PATCH] RISC-V: Fix reduc_strict_run-1 test case.

2023-08-17 Thread Robin Dapp via Gcc-patches
> I'm not opposed to merging the test change, but I couldn't figure out
> where in C the implicit conversion was coming from: as far as I can
> tell the macros don't introduce any (it's "return _float16 *
> _float16"), I'd had the patch open since last night but couldn't
> figure it out.
> 
> We get a bunch of half->single->half converting in the generated
> assembly that smelled like we had a bug somewhere else, sorry if I'm
> just missing something...

Yes, good point, my explanation was wrong again.

What really (TM) happens is that the equality comparison, in presence
of _Float16 emulation(!), performs an extension to float/double for its
arguments.

So
  if (res != r * q)
is
  if ((float)res (float)!= (float)(r * q))

Now, (r * q) is also implicitly computed in float.  Because the
comparison requires a float argument, there is no intermediate conversion
back to _Float16 and the value is more accurate than it would be in
_Float16.
res, however, despite being calculated in float as well, is converted
to _Float16 for the function return or rather the assignment to "res".
Therefore it is less accurate than (r * q) and the comparison fails.

So, what would also help, even though it's not obvious at first
sight is:

 TYPE res = reduc_plus_##TYPE (a, b);   \
-if (res != r * q)  \
+TYPE ref = r * q;  \
+if (res != ref)\
   __builtin_abort ();  \
   }

This does not happen with proper _zfh because the operations are done
in _Float16 precision then.  BTW such kinds of non-obvious problems
are the reason why I split off _zvfh run tests into separate files
right away.

Regards
 Robin



Re: [PATCH] RISC-V: Fix reduc_strict_run-1 test case.

2023-08-16 Thread Robin Dapp via Gcc-patches
> But if it's a float16 precision issue then I would have expected both
> the computations for the lhs and rhs values to have suffered
> similarly.

Yeah, right.  I didn't look closely enough.  The problem is not the
reduction but the additional return-value conversion that is omitted
when calculating the reference value inline.

The attached is simpler and does the trick.

Regards
 Robin

Subject: [PATCH v2] RISC-V: Fix reduc_strict_run-1 test case.

This patch fixes the reduc_strict_run-1 testcase by converting
the reference value to double and back to the tested type.
Without that omitted the implicit return-value conversion and
would produce a different result for _Float16.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c:
Perform type -> double -> type conversion for reference value.
---
 .../gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c
index 516be97e9eb..d5a544b1cc9 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c
@@ -17,7 +17,7 @@
asm volatile ("" ::: "memory"); \
   }\
 TYPE res = reduc_plus_##TYPE (a, b);   \
-if (res != r * q)  \
+if (res != (TYPE)(double)(r * q))  \
   __builtin_abort ();  \
   }
 
-- 
2.41.0




Re: [PATCH] IFN: Fix vector extraction into promoted subreg.

2023-08-16 Thread Robin Dapp via Gcc-patches
> However:
> 
> | #define vec_extract_direct { 3, 3, false }
> 
> This looks wrong.  The numbers are argument numbers (or -1 for a return
> value).  vec_extract only takes 2 arguments, so 3 looks to be out-of-range.
> 
> | #define direct_vec_extract_optab_supported_p direct_optab_supported_p
> 
> I would expect this to be convert_optab_supported_p.
> 
> On the promoted subreg thing, I think expand_vec_extract_optab_fn
> should use expand_fn_using_insn.

Thanks, really easier that way.  Attached a new version that's currently
bootstrapping.  Does that look better?

Regards
 Robin

Subject: [PATCH v2] internal-fn: Fix vector extraction into promoted subreg.

This patch fixes the case where vec_extract gets passed a promoted
subreg (e.g. from a return value).  This is achieved by using
expand_convert_optab_fn instead of a separate expander function.

gcc/ChangeLog:

* internal-fn.cc (vec_extract_direct): Change type argument
numbers.
(expand_vec_extract_optab_fn): Call convert_optab_fn.
(direct_vec_extract_optab_supported_p): Use
convert_optab_supported_p.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c: New test.
---
 gcc/internal-fn.cc|  44 +-
 .../rvv/autovec/vls-vlmax/vec_extract-1u.c|  63 
 .../rvv/autovec/vls-vlmax/vec_extract-2u.c|  69 +
 .../rvv/autovec/vls-vlmax/vec_extract-3u.c|  69 +
 .../rvv/autovec/vls-vlmax/vec_extract-4u.c|  70 +
 .../rvv/autovec/vls-vlmax/vec_extract-runu.c  | 137 ++
 6 files changed, 413 insertions(+), 39 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 4f2b20a79e5..5cce36a789b 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -175,7 +175,7 @@ init_internal_fns ()
 #define len_store_direct { 3, 3, false }
 #define mask_len_store_direct { 4, 5, false }
 #define vec_set_direct { 3, 3, false }
-#define vec_extract_direct { 3, 3, false }
+#define vec_extract_direct { 0, -1, false }
 #define unary_direct { 0, 0, true }
 #define unary_convert_direct { -1, 0, true }
 #define binary_direct { 0, 0, true }
@@ -3127,43 +3127,6 @@ expand_vec_set_optab_fn (internal_fn, gcall *stmt, 
convert_optab optab)
   gcc_unreachable ();
 }
 
-/* Expand VEC_EXTRACT optab internal function.  */
-
-static void
-expand_vec_extract_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
-{
-  tree lhs = gimple_call_lhs (stmt);
-  tree op0 = gimple_call_arg (stmt, 0);
-  tree op1 = gimple_call_arg (stmt, 1);
-
-  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
-
-  machine_mode outermode = TYPE_MODE (TREE_TYPE (op0));
-  machine_mode extract_mode = TYPE_MODE (TREE_TYPE (lhs));
-
-  rtx src = expand_normal (op0);
-  rtx pos = expand_normal (op1);
-
-  class expand_operand ops[3];
-  enum insn_code icode = convert_optab_handler (optab, outermode,
-   extract_mode);
-
-  if (icode != CODE_FOR_nothing)
-{
-  create_output_operand ([0], target, extract_mode);
-  create_input_operand ([1], src, outermode);
-  create_convert_operand_from ([2], pos,
-  TYPE_MODE (TREE_TYPE (op1)), true);
-  if (maybe_expand_insn (icode, 3, ops))
-   {
- if (!rtx_equal_p (target, ops[0].value))
-   emit_move_insn (target, ops[0].value);
- return;
-   }
-}
-  gcc_unreachable ();
-}
-
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -3917,6 +3880,9 @@ expand_convert_optab_fn (internal_fn fn, gcall *stmt, 
convert_optab optab,
 #define expand_unary_convert_optab_fn(FN, STMT, OPTAB) \
   expand_convert_optab_fn (FN, STMT, OPTAB, 1)
 
+#define expand_vec_extract_optab_fn(FN, STMT, OPTAB) \
+  expand_convert_optab_fn (FN, STMT, OPTAB, 2)
+
 /* RETURN_TYPE and ARGS are a return type and argument list that are
in principle compatible with FN (which satisfies direct_internal_fn_p).
Return the types that should be used to determine whether the
@@ -4019,7 +3985,7 @@ multi_vector_optab_supported_p (convert_optab optab, 
tree_pair types,
 #define direct_mask_len_fold_left_optab_supported_p 

[PATCH] RISC-V: Fix reduc_strict_run-1 test case.

2023-08-15 Thread Robin Dapp via Gcc-patches
Hi,

this patch changes the equality check for the reduc_strict_run-1
testcase from == to fabs () < EPS.  The FAIL only occurs with
_Float16 but I'd argue approximate equality is preferable for all
float modes.

Regards
 Robin

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c:
Check float equality with fabs < EPS.
---
 .../riscv/rvv/autovec/reduc/reduc_strict_run-1.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c
index 516be97e9eb..93efe2c4333 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_strict_run-1.c
@@ -2,6 +2,9 @@
 /* { dg-additional-options "--param=riscv-autovec-preference=scalable 
-fno-vect-cost-model" } */
 
 #include "reduc_strict-1.c"
+#include 
+
+#define EPS 1e-2
 
 #define TEST_REDUC_PLUS(TYPE)  \
   {\
@@ -10,14 +13,14 @@
 TYPE r = 0, q = 3; \
 for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
   {\
-   a[i] = (i * 0.1) * (i & 1 ? 1 : -1);\
-   b[i] = (i * 0.3) * (i & 1 ? 1 : -1);\
+   a[i] = (i * 0.01) * (i & 1 ? 1 : -1);   \
+   b[i] = (i * 0.03) * (i & 1 ? 1 : -1);   \
r += a[i];  \
q -= b[i];  \
asm volatile ("" ::: "memory"); \
   }\
 TYPE res = reduc_plus_##TYPE (a, b);   \
-if (res != r * q)  \
+if (fabs (res - r * q) > EPS)  \
   __builtin_abort ();  \
   }
 
-- 
2.41.0


[PATCH] IFN: Fix vector extraction into promoted subreg.

2023-08-15 Thread Robin Dapp via Gcc-patches
Hi,

this patch fixes the case where vec_extract gets passed a promoted
subreg (e.g. from a return value).  When such a subreg is the
destination of a vector extraction we create a separate pseudo
register and ensure that the necessary promotion is performed
afterwards.

Before this patch a sign-extended subreg would erroneously not
be zero-extended e.g. when used as return value.  I added missing
test cases for unsigned vec_extract on RISC-V that check the
proper behavior.

Testsuite and bootstrap done on x86, aarch64 and power10.

Regards
 Robin

gcc/ChangeLog:

* internal-fn.cc (expand_vec_extract_optab_fn): Handle
SUBREG_PROMOTED_VAR_P.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c: New test.
---
 gcc/internal-fn.cc|  25 +++-
 .../rvv/autovec/vls-vlmax/vec_extract-1u.c|  63 
 .../rvv/autovec/vls-vlmax/vec_extract-2u.c|  69 +
 .../rvv/autovec/vls-vlmax/vec_extract-3u.c|  69 +
 .../rvv/autovec/vls-vlmax/vec_extract-4u.c|  70 +
 .../rvv/autovec/vls-vlmax/vec_extract-runu.c  | 137 ++
 6 files changed, 430 insertions(+), 3 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 4f2b20a79e5..b1b12cc8369 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3150,14 +3150,33 @@ expand_vec_extract_optab_fn (internal_fn, gcall *stmt, 
convert_optab optab)
 
   if (icode != CODE_FOR_nothing)
 {
-  create_output_operand ([0], target, extract_mode);
+  /* Some backends like riscv sign-extend the extraction result to a full
+Pmode register.  If we are passed a promoted subreg as target make
+sure not to use it as target directly.  Instead, use a new pseudo
+and perform the necessary extension afterwards. */
+  rtx dest = target;
+  if (target && SUBREG_P (target) && SUBREG_PROMOTED_VAR_P (target))
+   dest = gen_reg_rtx (extract_mode);
+
+  create_output_operand ([0], dest, extract_mode);
+
   create_input_operand ([1], src, outermode);
   create_convert_operand_from ([2], pos,
   TYPE_MODE (TREE_TYPE (op1)), true);
   if (maybe_expand_insn (icode, 3, ops))
{
- if (!rtx_equal_p (target, ops[0].value))
-   emit_move_insn (target, ops[0].value);
+ if (!rtx_equal_p (dest, target))
+   {
+ if (SUBREG_P (target) && SUBREG_PROMOTED_VAR_P (target))
+   {
+ /* Have convert_move perform the subreg promotion.  */
+ rtx tmp = convert_to_mode (extract_mode, ops[0].value, 0);
+ convert_move (SUBREG_REG (target), tmp,
+   SUBREG_PROMOTED_SIGN (target));
+   }
+ else
+   emit_move_insn (target, dest);
+   }
  return;
}
 }
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c
new file mode 100644
index 000..a35988ff55d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv64gcv_zvfh -mabi=lp64d -Wno-pedantic 
-Wno-psabi" } */
+
+#include 
+
+typedef uint64_t vnx2di __attribute__((vector_size (16)));
+typedef uint32_t vnx4si __attribute__((vector_size (16)));
+typedef uint16_t vnx8hi __attribute__((vector_size (16)));
+typedef uint8_t vnx16qi __attribute__((vector_size (16)));
+
+#define VEC_EXTRACT(S,V,IDX)   \
+  S\
+  __attribute__((noipa))   \
+  vec_extract_##V##_##IDX (V v)\
+  {\
+return v[IDX]; \
+  }
+
+#define VEC_EXTRACT_VAR1(S,V)  \
+  S\
+  __attribute__((noipa))   \
+  vec_extract_var_##V (V v, int8_t idx)\
+  {\
+return v[idx]; 

Re: [PATCH] RISC-V: Implement vector "average" autovec pattern.

2023-08-15 Thread Robin Dapp via Gcc-patches
> Plz put your testcases into:
> 
> # widening operation only test on LMUL < 8
> set AUTOVEC_TEST_OPTS [list \
>   {-ftree-vectorize -O3 --param riscv-autovec-lmul=m1} \
>   {-ftree-vectorize -O3 --param riscv-autovec-lmul=m2} \
>   {-ftree-vectorize -O3 --param riscv-autovec-lmul=m4} \
>   {-ftree-vectorize -O2 --param riscv-autovec-lmul=m1} \
>   {-ftree-vectorize -O2 --param riscv-autovec-lmul=m2} \
>   {-ftree-vectorize -O2 --param riscv-autovec-lmul=m4} ]
> foreach op $AUTOVEC_TEST_OPTS {
>   dg-runtest [lsort [glob -nocomplain 
> $srcdir/$subdir/autovec/widen/*.\[cS\]]] \
>     "" "$op"
> }
> 
> You could either simpilfy put them into "widen" directory or create a new 
> directly.
> Anyway, make sure you have fully tested it with LMUL = 1/2/4.

Ah, almost forgot this.  I moved the tests to the widen directory
and will push it after testing.

Regards
 Robin


Re: [PATCH] RISC-V: Fix autovec_length_operand predicate[PR110989]

2023-08-15 Thread Robin Dapp via Gcc-patches
> Currently, autovec_length_operand predicate incorrect configuration is
> discovered in PR110989 since this following situation:

In case you haven't committed it yet: This is OK.

Regards
 Robin


Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-14 Thread Robin Dapp via Gcc-patches
Hi Kewen,

> I did a bootstrapping and regression testing on Power10 (LE) and found a lot 
> of failures.

I think the problem is that just like for vec_set we're expecting
the vec_extract expander not to fail.  It is probably passed not a
const int here anymore and therefore fails to expand?

can_vec_extract_var_idx_p is supposed to check if the backend
supports extracting a variable index.

Regards
 Robin


Re: [PATCH] RISC-V: Add MASK vec_duplicate pattern[PR110962]

2023-08-10 Thread Robin Dapp via Gcc-patches
> Is this patch ok ? Maybe we can find a way to add a target specific
> fortran test but should not block this bug fix.

It's not much different than adding a C testcase actually, apart from 
starting comments with a !

But well, LGTM.  The test doesn't look that complicated and quite likely
is covered by the Fortran testsuite already.

Regards
 Robin


Re: [PATCH V2] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-10 Thread Robin Dapp via Gcc-patches
> Hmm, I think VEC_EXTRACT and VEC_SET should be ECF_CONST.  Maybe the 
> GIMPLE ISEL
> comments do not match the implementation, but then that should be fixed?
> 
> /* Expand all ARRAY_REF(VIEW_CONVERT_EXPR) gimple assignments into calls 
> to
>internal function based on vector type of selected expansion.
> 
>For vec_set:
> 
>  VIEW_CONVERT_EXPR(u)[_1] = i_4(D);
>=>
>  _7 = u;
>  _8 = .VEC_SET (_7, i_4(D), _1);
>  u = _8;
>   
>For vec_extract:
> 
>   _3 = VIEW_CONVERT_EXPR(vD.2208)[idx_2(D)];
>=>
>   _4 = vD.2208;
>   _3 = .VEC_EXTRACT (_4, idx_2(D));  */
> 

I probably just forgot to set ECF_CONST in the recent isel patch
for vec_extract.

Regards
 Robin


Re: [PATCH] RISC-V: Support TU for integer ternary OP[PR110964]

2023-08-10 Thread Robin Dapp via Gcc-patches
OK.

Regards
 Robin



Re: [PATCH] RISC-V: Add MASK vec_duplicate pattern[PR110962]

2023-08-10 Thread Robin Dapp via Gcc-patches
Is the testcase already in the test suite?  If not we should add it.
Apart from that LGTM. 

Regards
 Robin


Re: [PATCH] RISC-V: Add missing modes to the iterators

2023-08-10 Thread Robin Dapp via Gcc-patches
Yeah, thanks, better in this separate patch.

OK.

Regards
 Robin



Re: [PATCH] RISC-V: Support NPATTERNS = 1 stepped vector[PR110950]

2023-08-09 Thread Robin Dapp via Gcc-patches
OK, thanks.

Regards
 Robin


Re: [PATCH] vect: Add a popcount fallback.

2023-08-09 Thread Robin Dapp via Gcc-patches
> We seem to be looking at promotions of the call argument, lhs_type
> is the same as the type of the call LHS.  But the comment mentions .POPCOUNT
> and the following code also handles others, so maybe handling should be
> moved.  Also when we look to vectorize popcount (x) instead of popcount((T)x)
> we can simply promote the result accordingly.

IMHO lhs_type is the type of the conversion

  lhs_oprnd = gimple_assign_lhs (last_stmt);
  lhs_type = TREE_TYPE (lhs_oprnd);

and rhs/unprom_diff has the type of the call's input argument

  rhs_oprnd = gimple_call_arg (call_stmt, 0);
  vect_look_through_possible_promotion (vinfo, rhs_oprnd, _diff);

So we can potentially have
  T0 arg
  T1 in = (T1)arg
  T2 ret = __builtin_popcount (in)
  T3 lhs = (T3)ret

and we're checking if precision (T0) == precision (T3).

This will never be true for a proper __builtin_popcountll except if
the return value is cast to uint64_t (which I just happened to do
in my test...).  Therefore it still doesn't really make sense to me.

Interestingly though, it helps for an aarch64 __builtin_popcountll
testcase where we abort here and then manage to vectorize via
vectorizable_call.  When we skip this check, recognition succeeds
and replaces the call with the pattern.  Then scalar costs are lower
than in the vectorizable_call case because __builtin_popcountll is
not STMT_VINFO_RELEVANT_P anymore (not live or so?).
Then, vectorization costs are too high compared to the wrong scalar
costs and we don't vectorize... Odd, might require fixing separately.
We might need to calculate the scalar costs in advance?

> It looks like vect_recog_popcount_clz_ctz_ffs_pattern is specifcally for
> the conversions, so your fallback should possibly apply even when not
> matching them.

Mhm, yes it appears to only match when casting the return value to
something else than an int.  So we'd need a fallback in vectorizable_call?
And it would potentially look a bit out of place there only handling
popcount and not ctz, clz, ...  Not sure if it is worth it then?

Regards
 Robin



Re: [PATCH] vect: Add a popcount fallback.

2023-08-08 Thread Robin Dapp via Gcc-patches


> Presumably this is an alternative to the approach Juzhe posted a week
> or two ago and ultimately dropped?

Yeah, I figured having a generic fallback could help more targets.
We can still have a better expander if we see the need.

Regards
 Robin


Re: [PATCH] vect: Add a popcount fallback.

2023-08-08 Thread Robin Dapp via Gcc-patches
> Hmm, the conversion should be a separate statement so I wonder
> why it would go wrong?

It is indeed.  Yet, lhs_type is the lhs type of the conversion
and not the call and consequently we compare the precision of
the converted type with the popcount input.

So we should probably rather do something like:

+  tree call_lhs = gimple_call_lhs (call_stmt);
+
   /* Input and output of .POPCOUNT should be same-precision integer.  */
-  if (TYPE_PRECISION (unprom_diff.type) != TYPE_PRECISION (lhs_type))
+  if (TYPE_PRECISION (unprom_diff.type) != TYPE_PRECISION (TREE_TYPE 
(call_lhs)))
 return NULL;

Regards
 Robin


Re: [PATCH] RISC-V: Allow CONST_VECTOR for VLS modes.

2023-08-08 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

just some nits.

> -  else if (rtx_equal_p (step, constm1_rtx) && poly_int_rtx_p (base, )
> +  else if (rtx_equal_p (step, constm1_rtx)
> +&& poly_int_rtx_p (base, )

Looks like just a line-break change and the line is not too long?

> -  rtx ops[] = {dest, vid, gen_int_mode (nunits_m1, GET_MODE_INNER 
> (mode))};
> -  insn_code icode = code_for_pred_sub_reverse_scalar (mode);
> -  emit_vlmax_insn (icode, RVV_BINOP, ops);
> +  if (value.is_constant () && IN_RANGE (value.to_constant (), -16, 15))

At some point, we'd want to unify all the [-16, 15] handling.  We already have
simm5_p but that takes an rtx.  Not urgent for now just to keep in mind.

> + {
> +   rtx dup = gen_const_vector_dup (mode, value);
> +   rtx ops[] = {dest, dup, vid};
> +   insn_code icode = code_for_pred (MINUS, mode);
> +   emit_vlmax_insn (icode, RVV_BINOP, ops);
> + }
> +  else
> + {
> +   rtx ops[]
> + = {dest, vid, gen_int_mode (nunits_m1, GET_MODE_INNER (mode))};
> +   insn_code icode = code_for_pred_sub_reverse_scalar (mode);
> +   emit_vlmax_insn (icode, RVV_BINOP, ops);
> + }
>return;
>  }
>else
> @@ -1416,7 +1428,9 @@ expand_const_vector (rtx target, rtx src)
>rtx base, step;
>if (const_vec_series_p (src, , ))
>  {
> -  emit_insn (gen_vec_series (mode, target, base, step));
> +  rtx tmp = gen_reg_rtx (mode);
> +  emit_insn (gen_vec_series (mode, tmp, base, step));
> +  emit_move_insn (target, tmp);

This seems a bit inconsistent from a caller's perspective
as we also do emit_insn (gen_vec_series, ...) without extra move
at another spot.  Can we handle this directly in expand_vec_series?

> +  (V1HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16")
>(V2HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16")
>(V4HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16")
>(V8HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16")
> @@ -479,6 +480,7 @@
>(V512HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16 && TARGET_MIN_VLEN 
> >= 1024")
>(V1024HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16 && TARGET_MIN_VLEN 
> >= 2048")
>(V2048HF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_16 && TARGET_MIN_VLEN 
> >= 4096")
> +  (V1SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32")
>(V2SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32")
>(V4SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32")
>(V8SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32")
> @@ -489,6 +491,7 @@
>(V256SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32 && TARGET_MIN_VLEN 
> >= 1024")
>(V512SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32 && TARGET_MIN_VLEN 
> >= 2048")
>(V1024SF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_32 && TARGET_MIN_VLEN 
> >= 4096")
> +  (V1DF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_64")
>(V2DF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_64")
>(V4DF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_64")
>(V8DF "TARGET_VECTOR_VLS && TARGET_VECTOR_ELEN_FP_64 && TARGET_MIN_VLEN >= 
> 64")

This hunk seems unrelated to the rest.  I suppose it's just a fixup
for 1-element float vectors for VLS?

Apart from that, looks good to me.

Regards
 Robin



Re: [PATCH] vect: Add a popcount fallback.

2023-08-08 Thread Robin Dapp via Gcc-patches
> Well, not sure how VECT_COMPARE_COSTS can help here, we either
> get the pattern or vectorize the original function.  There's no special 
> handling
> for popcount in vectorizable_call so all special cases are handled via 
> patterns.
> I was thinking of popcounthi via popcountsi and zero-extend / truncate but
> also popcountdi via popcountsi and reducing even/odd SI results via a plus
> to a single DI result.  It might be that targets without DI/TI popcount 
> support
> but SI popcount support might exist and that this might be cheaper than
> the generic open-coded scheme.  But of course such target could then
> implement the DImode version with that trick itself.

Ah, then I misunderstood.  Yes, that would be a better fallback option.
A thing for my "spare time" pile :)

Btw another thing I noticed:

  /* Input and output of .POPCOUNT should be same-precision integer.  */
  if (TYPE_PRECISION (unprom_diff.type) != TYPE_PRECISION (lhs_type))
return NULL;

This prevents us from vectorizing i.e.
(uint64_t)__builtin_popcount(uint32_t).  It appears like an
unnecessary restriction as all types should be able to hold a popcount
result (as long as TYPE_PRECISION > 6) if the result is properly
converted?  Maybe it complicates the fallback handling but in general
we should be fine?

> I agree with two cases it isn't too bad, note you probably get away
> with using the full 64bit constant for both 64bit and 32bit, we simply
> truncate it.  Note rather than 'ull' we have the HOST_WIDE_INT_UC
> macro which appends the appropriate suffix.
> 
> The patch is OK with or without changing this detail.

Thanks, changed to the full constant.  Going to push after bootstrap
and testsuite runs.

Regards
 Robin


Re: [PATCH v2] Mode-Switching: Fix SET_SRC ICE when USE or CLOBBER

2023-08-08 Thread Robin Dapp via Gcc-patches
> Could you please help to share how to enable checks here?
Build with --enable-checking or rather --enable-checking=extra.

Regards
 Robin



Re: [PATCH] vect: Add a popcount fallback.

2023-08-08 Thread Robin Dapp via Gcc-patches
> Looks reasonable to me - I couldn't read from above whether you did
> testing on riscv and thus verified the runtime correctness of the fallback?
> If not may I suggest to force matching the pattern on a target you can
> test for this purpose?

I tested on riscv (manually and verified the run test) but didn't bootstrap.
The vector test suite (or autovec) is not yet enabled by default anyway but
that's going to change soon.

> ... note this doesn't actually check the target can do these operations,
> you'd have to look whether optab_handler (optab, TYPE_MODE (vec_type))
> isn't CODE_FOR_nothing.  I see we don't do this consistently though,
> and the alternative is a known unsupported popcount.

Yes, agreed.  I changed it to

static bool
vect_have_popcount_fallback (tree vec_type)
{
  return ((target_has_op_for_code (RSHIFT_EXPR, vec_type, optab_scalar)
   || target_has_op_for_code (RSHIFT_EXPR, vec_type, optab_vector))
  && target_has_op_for_code (PLUS_EXPR, vec_type, optab_default)
  && target_has_op_for_code (MINUS_EXPR, vec_type, optab_default)
  && target_has_op_for_code (BIT_AND_EXPR, vec_type, optab_default)
  && target_has_op_for_code (MULT_EXPR, vec_type, optab_default));
}

target_has_vecop_for_code was already there further down so I
repurposed it that one

+/* Return true iff the target has an optab implementing the operation
+   CODE on type VECTYPE using the optab subtype OPTAB_TYPE.  */
+
+static bool
+target_has_op_for_code (tree_code code, tree vectype,
+   enum optab_subtype optab_type)
+{
+  optab optab = optab_for_tree_code (code, vectype, optab_type);
+  return optab
+&& optab_handler (optab, TYPE_MODE (vectype)) != CODE_FOR_nothing;
+}

Changes attached.

> Did you check whether we try popcount with DImode before using the
> fallback for SImode?  Or whether we try V2nSImode before falling
> back to VnDImode?  Note that if the target has popcountqi or hi then
> we can end up pattern matching popcount for those modes, not sure
> whether targets usually support vectorized those.

I haven't observed cases where we vectorize a "worse" mode now.
At least aarch64 tries all modes for vectorization and compares costs
(starting with the widest mode IIRC) so I would expect the fallback
version to always have higher costs and not be selected if there
is a real popcount available.  riscv also has VECT_COMPARE_COSTS.

Power has QImode and HImode vector popcounts, no VECT_COMPARE_COSTS
but the testsuite is unchanged FWIW.  s390 is similar but I couldn't
test it.  A problem would probably occur if a target provides
e.g. only popcountv16qi but we would emit a fallback for popcountv2di?
I'd hope there is no such target :D and if so it should use
VECT_COMPARE_COSTS?  

> Hmm, looks like we miss a useful helper to produce an
> integer constant with a repeated byte sequence?  A
> 
> unsigned char buf[8];
> memset (buf, val, 8);
> c1 = native_interpret (...);
> 
> would do the trick but I guess we can have it cheaper using wide-int
> directly?  This must have come up before ...

I didn't find something comparable and that's probably due to the
lack of a proper search term.  Also, I figured the 2-byte repeating
sequences might be trickier anyway and therefore kept it as is.
If you find it too cumbersome I can look for an alternative.
Right now it closely matches what the example C code says which
is not too bad IMHO.

Regards
 Robin

>From 03d7e953346b763bc3d0359d7d77b1f65ca05d46 Mon Sep 17 00:00:00 2001
From: Robin Dapp 
Date: Tue, 1 Aug 2023 22:05:09 +0200
Subject: [PATCH] vect: Add a popcount fallback.

This patch adds a fallback when the backend does not provide a popcount
implementation.  The algorithm is the same one libgcc uses, as well as
match.pd for recognizing a popcount idiom.

gcc/ChangeLog:

* tree-vect-patterns.cc (vect_have_popcount_fallback): New
function.
(vect_generate_popcount_fallback): New function to emit
vectorized popcount fallback.
(vect_recog_ctz_ffs_pattern): Use fallback.
(vect_recog_popcount_clz_ctz_ffs_pattern): Ditto.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-popcount-fallback.c: New test.
---
 .../gcc.dg/vect/vect-popcount-fallback.c  | 106 +
 gcc/tree-vect-patterns.cc | 205 +++---
 2 files changed, 286 insertions(+), 25 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c 
b/gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c
new file mode 100644
index 000..c1d23257b8f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c
@@ -0,0 +1,106 @@
+/* Check if we vectorize popcount when no expander is available.  */
+/* { dg-do run { target { amdgcn-*-* sparc*-*-* alpha*-*-* ia64-*-* mips*-*-* 
riscv*-*-* } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model 

[PATCH] vect: Add a popcount fallback.

2023-08-07 Thread Robin Dapp via Gcc-patches
Hi,

This patch adds a fallback when the backend does not provide a popcount
implementation.  The algorithm is the same one libgcc uses, as well as
match.pd for recognizing a popcount idiom.  __builtin_ctz and __builtin_ffs
can also rely on popcount so I used the fallback for them as well.

Bootstrapped and regtested on x86, aarch64 and power10.  Unfortunately
I don't have access to any architecture other than riscv that vectorizes
but does not have a vectorized popcount.  I added all vect_int targets
to the selector where a cursory grep "expand.*popcount" would yield no
result. 

Regards
 Robin

gcc/ChangeLog:

* tree-vect-patterns.cc (vect_have_popcount_fallback): New
function.
(vect_generate_popcount_fallback): New function to emit
vectorized popcount fallback.
(vect_recog_ctz_ffs_pattern): Use fallback.
(vect_recog_popcount_clz_ctz_ffs_pattern): Ditto.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-popcount-fallback.c: New test.
---
 .../gcc.dg/vect/vect-popcount-fallback.c  | 106 +++
 gcc/tree-vect-patterns.cc | 172 --
 2 files changed, 267 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c 
b/gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c
new file mode 100644
index 000..f6300f4ab35
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-popcount-fallback.c
@@ -0,0 +1,106 @@
+/* Check if we vectorize popcount when no expander is available.  */
+/* { dg-do run { target { amdgcn-*-* sparc*-*-* alpha*-*-* ia64-*-* mips*-*-* 
riscv*-*-* } } } */
+/* { dg-additional-options { -O2 -fdump-tree-vect-details -fno-vect-cost-model 
} }  */
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include 
+#include 
+
+__attribute__ ((noipa))
+void popc32 (uint32_t *restrict dst, uint32_t *restrict a, int n)
+{
+  for (int i = 0; i < n; i++)
+dst[i] = __builtin_popcount (a[i]);
+}
+
+__attribute__ ((noipa))
+void ctz32 (uint32_t *restrict dst, uint32_t *restrict a, int n)
+{
+  for (int i = 0; i < n; i++)
+dst[i] = __builtin_ctz (a[i]);
+}
+
+__attribute__ ((noipa))
+void ffs32 (uint32_t *restrict dst, uint32_t *restrict a, int n)
+{
+  for (int i = 0; i < n; i++)
+dst[i] = __builtin_ffs (a[i]);
+}
+
+__attribute__ ((noipa))
+void popc64 (uint64_t *restrict dst, uint64_t *restrict a, int n)
+{
+  for (int i = 0; i < n; i++)
+dst[i] = __builtin_popcountll (a[i]);
+}
+
+__attribute__ ((noipa))
+void ctz64 (uint64_t *restrict dst, uint64_t *restrict a, int n)
+{
+  for (int i = 0; i < n; i++)
+dst[i] = __builtin_ctzll (a[i]);
+}
+
+__attribute__ ((noipa))
+void ffs64 (uint64_t *restrict dst, uint64_t *restrict a, int n)
+{
+  for (int i = 0; i < n; i++)
+dst[i] = __builtin_ffsll (a[i]);
+}
+
+#define SZ 512
+
+__attribute__ ((optimize ("0")))
+int main ()
+{
+  uint32_t *a32pc = malloc (SZ * sizeof (*a32pc));
+  uint32_t *b32pc = malloc (SZ * sizeof (*b32pc));
+  uint32_t *a32ctz = malloc (SZ * sizeof (*a32ctz));
+  uint32_t *b32ctz = malloc (SZ * sizeof (*b32ctz));
+  uint32_t *a32ffs = malloc (SZ * sizeof (*a32ffs));
+  uint32_t *b32ffs = malloc (SZ * sizeof (*b32ffs));
+
+  uint64_t *a64pc = malloc (SZ * sizeof (*a64pc));
+  uint64_t *b64pc = malloc (SZ * sizeof (*b64pc));
+  uint64_t *a64ctz = malloc (SZ * sizeof (*a64ctz));
+  uint64_t *b64ctz = malloc (SZ * sizeof (*b64ctz));
+  uint64_t *a64ffs = malloc (SZ * sizeof (*a64ffs));
+  uint64_t *b64ffs = malloc (SZ * sizeof (*b64ffs));
+
+  for (int i = 0; i < SZ; i++)
+{
+  int ia = i + 1;
+  a32pc[i] = ia * 1234567;
+  b32pc[i] = 0;
+  a32ctz[i] = ia * 1234567;
+  b32ctz[i] = 0;
+  a32ffs[i] = ia * 1234567;
+  b32ffs[i] = 0;
+  a64pc[i] = ia * 123456789ull;
+  b64pc[i] = 0;
+  a64ctz[i] = ia * 123456789ull;
+  b64ctz[i] = 0;
+  a64ffs[i] = ia * 123456789ull;
+  b64ffs[i] = 0;
+}
+
+  popc32 (b32pc, a32pc, SZ);
+  ctz32 (b32ctz, a32ctz, SZ);
+  ffs32 (b32ffs, a32ffs, SZ);
+  popc64 (b64pc, a64pc, SZ);
+  ctz64 (b64ctz, a64ctz, SZ);
+  ffs64 (b64ffs, a64ffs, SZ);
+
+  for (int i = 0; i < SZ; i++)
+{
+  assert (b32pc[i] == __builtin_popcount (a32pc[i]));
+  assert (b32ctz[i] == __builtin_ctz (a32ctz[i]));
+  assert (b32ffs[i] == __builtin_ffs (a32ffs[i]));
+  assert (b64pc[i] == __builtin_popcountll (a64pc[i]));
+  assert (b64ctz[i] == __builtin_ctzll (a64ctz[i]));
+  assert (b64ffs[i] == __builtin_ffsll (a64ffs[i]));
+}
+}
+
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 6 "vect" target { 
amdgcn-*-* sparc*-*-* alpha*-*-* ia64-*-* mips*-*-* riscv*-*-* } } }  */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index ef806e2346e..b812354b986 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -1782,6 +1782,122 @@ vect_recog_widen_abd_pattern (vec_info *vinfo, 
stmt_vec_info 

Re: [PATCH] RISC-V: Support VLS basic operation auto-vectorization

2023-08-07 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

thanks, looks good from my side.

> +/* { dg-final { scan-assembler-times {vand\.vi\s+v[0-9]+,\s*v[0-9]+,\s*-16} 
> 42 } } */
> +/* { dg-final { scan-assembler-not {csrr} } } */

I was actually looking for a scan-assembler-not vsetvli... but the
csrr will do as well.

Regards
 Robin


[PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-07 Thread Robin Dapp via Gcc-patches
Hi,

originally inspired by the wish to transform

 vmv v3, a0 ; = vec_duplicate
 vadd.vv v1, v2, v3

into

 vadd.vx v1, v2, a0

via fwprop for riscv, this patch enables the forward propagation
of UNARY_P sources.

As this involves potentially replacing a vector register with
a scalar register the ira_hoist_pressure machinery is used to
calculate the change in register pressure.  If the propagation
would increase the pressure beyond the number of hard regs, we
don't perform it.

The regpressure commit this patch depends on is not yet pushed
because I figured I'd post the code using it in case of further
comments.

The testsuite is unchanged on i386 and power10 but aarch64 has
some new FAILs but I'm not terribly familiar with aarch64, hence
some examples here:

The following cases shrn-combine-1/2/3 seem worse as we emit one
instruction more.

Before:
  shrnv30.8b, v30.8h, 2
  shrn2   v30.16b, v31.8h, 2
  str q30, [x1, x3]
After:
  ushrv30.8h, v30.8h, 2
  ushrv31.8h, v31.8h, 2
  uzp1v31.16b, v30.16b, v31.16b
  str q31, [x1, x3]

Here, one optimization already happens in fwprop so combine
cannot do the same work it did before.

vec-init-22-speed.c changes from
  sxthw0, w0
  sxthw1, w1
  fmovd31, x0
  fmovd0, x1
to:
  dup v31.4h, w0
  dup v0.4h, w1 
which I hope has the same semantics and is shorter.

Apart from that there are numerous check-asm testcases where
a new, earlier propagation prevents a later, supposedly better
propagation.  One such example is from ld4_s8.c:

  (insn 11 10 12 2 (set (reg:DI 100) 
  (neg:DI (reg:DI 102))) 385 {negdi2}
   (expr_list:REG_EQUAL (const_poly_int:DI [-576, -576])
   
  (nil)))
  (insn 12 11 13 2 (set (reg/f:DI 99)
  (plus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 100))) 153 {*adddi3_aarch64}
   (expr_list:REG_EQUAL (plus:DI (reg/v/f:DI 97 [ x0 ]) 
   
  (const_poly_int:DI [-576, -576])) 
   
  (nil)))
  (insn 13 12 14 2 (set (reg/v:VNx64QI 94 [ z0 ])   
   
  (unspec:VNx64QI [
  (reg/v:VNx16BI 96 [ p0 ])
  (mem:VNx64QI (reg/f:DI 99) [0 MEM  [(signed char 
*)_1]+0 S[64, 64] A8])
  ] UNSPEC_LDN)) 5885 {vec_mask_load_lanesvnx64qivnx16qi}   
   
   (nil)) 

where we now do:

  propagating insn 11 into insn 12, replacing:
  (set (reg/f:DI 99)
  (plus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 100)))
  successfully matched this instruction to subdi3:
  (set (reg/f:DI 99)
  (minus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 102)))

vs before:

  cannot propagate from insn 11 into insn 12: would increase complexity of 
pattern
  
  propagating insn 12 into insn 13, replacing:
  (set (reg/v:VNx64QI 94 [ z0 ])
  (unspec:VNx64QI [
  (reg/v:VNx16BI 96 [ p0 ])
  (mem:VNx64QI (reg/f:DI 99) [0 MEM  [(signed char 
*)_1]+0 S[64, 64] A8])
  ] UNSPEC_LDN))
  successfully matched this instruction to vec_mask_load_lanesvnx64qivnx16qi:
  (set (reg/v:VNx64QI 94 [ z0 ])
  (unspec:VNx64QI [
  (reg/v:VNx16BI 96 [ p0 ])
  (mem:VNx64QI (plus:DI (reg/v/f:DI 97 [ x0 ])
  (reg:DI 100)) [0 MEM  [(signed char *)_1]+0 
S[64, 64] A8])
  ] UNSPEC_LDN))

All in all this seems like a general problem with earlier
optimizations preventing later ones and surely all of those
could be fixed individually.  Still, the question remains if
the general approach is useful or desired or if we not rather
prevent more optimizations that we gain.  Suggestions welcome.

I have some riscv tests for it but didn't attach them yet
in order to focus on the main part first.

Regards
 Robin

gcc/ChangeLog:

* fwprop.cc (fwprop_propagation::profitable_p): Add unary
handling.
(fwprop_propagation::update_register_pressure): New function.
(fwprop_propagation::register_pressure_high_p): New function
(reg_single_def_for_src_p): Look through unary expressions.
(try_fwprop_subst_pattern): Check register pressure.
(forward_propagate_into): Call new function.
(fwprop_init): Init register pressure.
(fwprop_done): Clean up register pressure.
(fwprop_insn): Add comment.
---
 gcc/fwprop.cc | 307 --
 1 file changed, 300 insertions(+), 7 deletions(-)

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..413fe4e7335 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This 

  1   2   3   4   5   >