[PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-11 Thread Juzhe-Zhong
This patch support dynamic LMUL cost modeling with 
--param=riscv-autovec-lmul=dynamic.

Consider this following case:
void
foo (int32_t *__restrict a, int32_t *__restrict b,int32_t *__restrict c,
  int32_t *__restrict a2, int32_t *__restrict b2, int32_t *__restrict c2,
  int32_t *__restrict a3, int32_t *__restrict b3, int32_t *__restrict c3,
  int32_t *__restrict a4, int32_t *__restrict b4, int32_t *__restrict c4,
  int32_t *__restrict a5, int32_t *__restrict b5, int32_t *__restrict c5,
  int32_t *__restrict d,
  int32_t *__restrict d2,
  int32_t *__restrict d3,
  int32_t *__restrict d4,
  int32_t *__restrict d5,
  int n)
{
  for (int i = 0; i < n; i++)
{
  a[i] = b[i] + c[i];
  b5[i] = b[i] + c[i];
  a2[i] = b2[i] + c2[i];
  a3[i] = b3[i] + c3[i];
  a4[i] = b4[i] + c4[i];
  a5[i] = a[i] + a4[i];
  d2[i] = a2[i] + c2[i];
  d3[i] = a3[i] + c3[i];
  d4[i] = a4[i] + c4[i];
  d5[i] = a[i] + a4[i];
  a[i] = a5[i] + b5[i] + a[i];

  c2[i] = a[i] + c[i];
  c3[i] = b5[i] * a5[i];
  c4[i] = a2[i] * a3[i];
  c5[i] = b5[i] * a2[i];
  c[i] = a[i] + c3[i];
  c2[i] = a[i] + c4[i];
  a5[i] = a[i] + a4[i];
  a[i] = a[i] + b5[i] + a[i] * a2[i] * a3[i] * a4[i]
  * a5[i] * c[i] * c2[i] * c3[i] * c4[i] * c5[i]
  * d[i] * d2[i] * d3[i] * d4[i] * d5[i];
}
}

Demo: https://godbolt.org/z/x1acoMxGT

You can see it will produce register spilling if you specify LMUL >= 4

Now, with --param=riscv-autovec-lmul=dynamic.

GCC is able to pick LMUL = 2 to optimized this case.

This feature is supported by linear scan based local live ranges analysis and
compute maximum live V_REGS in specific program point of the function to 
determine the VF/LMUL.

Note that this patch can well handle both SLP and non-SLP loop.

Currenty approach didn't consider the later instruction scheduler which may 
improve the register pressure.
In this case, we are conservatively applying smaller VF/LMUL. (Not sure whether 
we should support live range shrink for such corner case since we don't known 
whether it can improve performance a lot.)

gcc/ChangeLog:

* config/riscv/riscv-vector-costs.cc (get_last_live_range): New 
function.
(compute_nregs_for_mode): Ditto.
(live_range_conflict_p): Ditto.
(max_number_of_live_regs): Ditto.
(compute_lmul): Ditto.
(costs::prefer_new_lmul_p): Ditto.
(costs::better_main_loop_than_p): Ditto.
* config/riscv/riscv-vector-costs.h (struct stmt_point): New struct.
(struct var_live_range): Ditto.
(struct autovec_info): Ditto.
* config/riscv/t-riscv: Update makefile for COST model.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul-mixed-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-10.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-8.c: New test.
* 

Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-06 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

general remark upfront:  Please add function-level comments for all
functions.  This makes reading and reviewing much easier.  I had to sweep
back and forth quite a bit.

> +
> +static int
> +get_last_live_range (const vec _ranges, tree var)
> +{
> +  unsigned int ix;
> +  var_live_range *live_range;
> +  FOR_EACH_VEC_ELT_REVERSE (live_ranges, ix, live_range)
> +if (live_range->var == var)
> +  return ix;
> +  return -1;
> +}

>From reading the usage site of this function it looks like we could benefit
from having the live ranges be a hash_map as well?  That way we wouldn't
need to scan through the list every time.  Something like
hash_map>.  It looks like we only consider the range
end anyway.

> +   int index = get_last_live_range (live_ranges, var);

That way we could avoid some worst-case behavior here for pathological
inputs.

> +   if (index == -1)
> + {
> +   var_live_range range = {var, 0, point};
> +   live_ranges.safe_push (range);
> + }

Please add a comment that we assume the variable is live from the start
of this block. 

> +   else
> + live_ranges[index].end = point;

And here a comment that we will grow the live range for each use.

> +static bool
> +live_range_conflict_p (const var_live_range _range1,
> +const var_live_range _range2)
> +{
> +  if (live_range1.start >= live_range2.end)
> +return false;
> +  if (live_range1.end <= live_range2.start)
> +return false;
> +  if (live_range2.start >= live_range1.end)
> +return false;
> +  if (live_range2.end <= live_range1.start)
> +return false;
> +  return true;
> +}

Rename to live_range_overlap_p and simplify to
 return a.end >= b.start || b.end >= a.start;

> +
> +static unsigned int
> +max_number_of_live_regs (const basic_block bb,
> +  const vec _ranges,
> +  machine_mode biggest_mode, int lmul)
> +{
> +  unsigned int max_nregs = 0;
> +  unsigned int i, j, k;
> +  unsigned int live_point = 0;
> +  for (i = 0; i < live_ranges.length (); i++)
> +{
> +  auto_vec conflict_live_ranges;
> +  var_live_range live_range = live_ranges[i];
> +  conflict_live_ranges.safe_push (live_range);
> +  unsigned int min_point = live_range.start;
> +  unsigned int max_point = live_range.end;
> +  for (j = 0; j < live_ranges.length (); j++)
> + {
> +   if (j == i)
> + continue;
> +   if (live_range_conflict_p (live_range, live_ranges[j]))
> + {
> +   conflict_live_ranges.safe_push (live_ranges[j]);
> +   min_point
> + = std::min (min_point, (unsigned int) live_ranges[j].start);
> +   max_point
> + = std::max (max_point, (unsigned int) live_ranges[j].end);
> + }
> + }
> +  for (j = min_point; j <= max_point; j++)
> + {
> +   unsigned int nregs = 0;
> +   for (k = 0; k < conflict_live_ranges.length (); k++)
> + {
> +   if (j >= (unsigned int) conflict_live_ranges[k].start
> +   && j <= (unsigned int) conflict_live_ranges[k].end)
> + {
> +   machine_mode mode
> + = TYPE_MODE (TREE_TYPE (conflict_live_ranges[k].var));
> +   nregs += compute_nregs_for_mode (mode, biggest_mode, lmul);
> + }
> + }
> +   if (nregs > max_nregs)
> + {
> +   max_nregs = nregs;
> +   live_point = j;
> + }
> + }
> +}

This looks pretty quadratic in the number of live ranges (or even cubic?).
Can't it be done more efficiently using a sliding-window approach by sorting
the live ranges according to their start point before?
Also std::min/max -> MIN/MAX.

> +
> +  /* Collect user explicit RVV type.  */
> +  hash_set all_preds = get_all_predecessors (bb);
> +  hash_set all_succs = get_all_successors (bb);

As mentioned before, maybe dominator info could help here?

> +  for (i = 0; i < cfun->gimple_df->ssa_names->length (); i++)
> +{
> +  tree t = ssa_name (i);
> +  if (!t)
> + continue;
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (t));
> +  if (!lookup_vector_type_attribute (TREE_TYPE (t))
> +   && !riscv_v_ext_vls_mode_p (mode))
> + continue;
> +
> +  gimple *def = SSA_NAME_DEF_STMT (t);
> +  if (gimple_bb (def) && !all_preds.contains (gimple_bb (def)))
> + continue;
> +  const ssa_use_operand_t *const head = &(SSA_NAME_IMM_USE_NODE (t));
> +  const ssa_use_operand_t *ptr;
> +
> +  for (ptr = head->next; ptr != head; ptr = ptr->next)
> + {
> +   if (USE_STMT (ptr) && !is_gimple_debug (USE_STMT (ptr)))
> + {
> +   if (all_succs.contains (gimple_bb (USE_STMT (ptr
> + {

Reverse the conditions and continue, i.e. if (!USE_STMT || is_gimple_debug
 || !all_succs.contains).

> +
> +static int
> 

Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-05 Thread Jeff Law via Gcc-patches




On 9/5/23 15:39, 钟居哲 wrote:

- Why don't we use the normal reverse postorder (or postorder) approach of
    computing live ranges?  Is that because we don't really need full global
    live ranges?

Yes. We don't need global live ranges.

- Why can't we use existing code i.e. tree-ssa-live?  I suspect I already
    know the answer but an explanation (in a comment) would still be useful.

The existing code can't help I have tried many times.
I would expect it to be fairly hard to use for this purpose.  I've tried 
to use it in other contexts as well without success.


Jeff


Re: Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-05 Thread 钟居哲
- Why don't we use the normal reverse postorder (or postorder) approach of
   computing live ranges?  Is that because we don't really need full global
   live ranges?

Yes. We don't need global live ranges.

- Why can't we use existing code i.e. tree-ssa-live?  I suspect I already
   know the answer but an explanation (in a comment) would still be useful.

The existing code can't help I have tried many times.

- Do we really need get_all_predecessors/get_all_successors?  As they're
   only used for "defined before" and "used after", at first glance it
   looks like some kind of dominance info could help there but I didn't
   really check in detail.

Yes. At the first time, I want to use dominance analysis but I am not sure 
whether
we can use df_analyze () in COST model framwork. It worth trying.

- Why don't we use bitmaps/sbitmaps like in vsetvl.cc and other related
   passes?  I don't mind maps but just wonder if it's on purpose, for
   convenience or something else.

I don't know how to use bitmap to substitue the current approach of using map.

Besides, it might help to rename program_points_map (into program_points_per_bb
or so).  At first it looked quadratic to me but we're just iterating over
the program points of a BB.

Ok.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-09-06 05:02
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; jeffreyalaw
Subject: Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model
Hi Juzhe,
 
I think the general approach makes sense and it doesn't need to be perfect
from the beginning as we can always iterate on it.  Before continuing with a
more detailed review (hopefully tomorrow) some high-level questions upfront.
It would help to document some of these choices so it's easier to understand
the rationale.
 
- Why don't we use the normal reverse postorder (or postorder) approach of
   computing live ranges?  Is that because we don't really need full global
   live ranges?
 
- Why can't we use existing code i.e. tree-ssa-live?  I suspect I already
   know the answer but an explanation (in a comment) would still be useful.
 
- Do we really need get_all_predecessors/get_all_successors?  As they're
   only used for "defined before" and "used after", at first glance it
   looks like some kind of dominance info could help there but I didn't
   really check in detail.
 
- Why don't we use bitmaps/sbitmaps like in vsetvl.cc and other related
   passes?  I don't mind maps but just wonder if it's on purpose, for
   convenience or something else. 
 
Besides, it might help to rename program_points_map (into program_points_per_bb
or so).  At first it looked quadratic to me but we're just iterating over
the program points of a BB.
 
Regards
Robin
 
 


Re: [PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-05 Thread Robin Dapp via Gcc-patches
Hi Juzhe,

I think the general approach makes sense and it doesn't need to be perfect
from the beginning as we can always iterate on it.  Before continuing with a
more detailed review (hopefully tomorrow) some high-level questions upfront.
It would help to document some of these choices so it's easier to understand
the rationale.

 - Why don't we use the normal reverse postorder (or postorder) approach of
   computing live ranges?  Is that because we don't really need full global
   live ranges?

 - Why can't we use existing code i.e. tree-ssa-live?  I suspect I already
   know the answer but an explanation (in a comment) would still be useful.

 - Do we really need get_all_predecessors/get_all_successors?  As they're
   only used for "defined before" and "used after", at first glance it
   looks like some kind of dominance info could help there but I didn't
   really check in detail.

 - Why don't we use bitmaps/sbitmaps like in vsetvl.cc and other related
   passes?  I don't mind maps but just wonder if it's on purpose, for
   convenience or something else. 

Besides, it might help to rename program_points_map (into program_points_per_bb
or so).  At first it looked quadratic to me but we're just iterating over
the program points of a BB.

Regards
 Robin



[PATCH V2] RISC-V: Support Dynamic LMUL Cost model

2023-09-05 Thread Juzhe-Zhong
This patch support dynamic LMUL cost modeling with 
--param=riscv-autovec-lmul=dynamic.

Consider this following case:
void
foo (int32_t *__restrict a, int32_t *__restrict b,int32_t *__restrict c,
  int32_t *__restrict a2, int32_t *__restrict b2, int32_t *__restrict c2,
  int32_t *__restrict a3, int32_t *__restrict b3, int32_t *__restrict c3,
  int32_t *__restrict a4, int32_t *__restrict b4, int32_t *__restrict c4,
  int32_t *__restrict a5, int32_t *__restrict b5, int32_t *__restrict c5,
  int32_t *__restrict d,
  int32_t *__restrict d2,
  int32_t *__restrict d3,
  int32_t *__restrict d4,
  int32_t *__restrict d5,
  int n)
{
  for (int i = 0; i < n; i++)
{
  a[i] = b[i] + c[i];
  b5[i] = b[i] + c[i];
  a2[i] = b2[i] + c2[i];
  a3[i] = b3[i] + c3[i];
  a4[i] = b4[i] + c4[i];
  a5[i] = a[i] + a4[i];
  d2[i] = a2[i] + c2[i];
  d3[i] = a3[i] + c3[i];
  d4[i] = a4[i] + c4[i];
  d5[i] = a[i] + a4[i];
  a[i] = a5[i] + b5[i] + a[i];

  c2[i] = a[i] + c[i];
  c3[i] = b5[i] * a5[i];
  c4[i] = a2[i] * a3[i];
  c5[i] = b5[i] * a2[i];
  c[i] = a[i] + c3[i];
  c2[i] = a[i] + c4[i];
  a5[i] = a[i] + a4[i];
  a[i] = a[i] + b5[i] + a[i] * a2[i] * a3[i] * a4[i]
  * a5[i] * c[i] * c2[i] * c3[i] * c4[i] * c5[i]
  * d[i] * d2[i] * d3[i] * d4[i] * d5[i];
}
}

Demo: https://godbolt.org/z/x1acoMxGT

You can see it will produce register spilling if you specify LMUL >= 4

Now, with --param=riscv-autovec-lmul=dynamic.

GCC is able to pick LMUL = 2 to optimized this case.

This feature is supported by linear scan based local live ranges analysis and
compute maximum live V_REGS in specific program point of the function to 
determine the VF/LMUL.

Note that this patch can well handle both SLP and non-SLP loop.

Currenty approach didn't consider the later instruction scheduler which may 
improve the register pressure.
In this case, we are conservatively applying smaller VF/LMUL. (Not sure whether 
we should support live range shrink for such corner case since we don't known 
whether it can improve performance a lot.)

gcc/ChangeLog:

* config/riscv/riscv-vector-costs.cc (get_last_live_range): New 
function.
(compute_nregs_for_mode): Ditto.
(live_range_conflict_p): Ditto.
(max_number_of_live_regs): Ditto.
(compute_lmul): Ditto.
(costs::prefer_new_lmul_p): Ditto.
(costs::better_main_loop_than_p): Ditto.
* config/riscv/riscv-vector-costs.h (struct stmt_point): New struct.
(struct var_live_range): Ditto.
(struct autovec_info): Ditto.
* config/riscv/t-riscv: Update makefile for COST model.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul-mixed-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-1.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-10.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-3.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-4.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-5.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-6.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-7.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-8.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-9.c: New test.
* gcc.dg/vect/costmodel/riscv/rvv/rvv-costmodel-vect.exp: New test.

---