I see, let the vec_dup enter the rtx_cost again to append the total to vmv, I 
have a try testing.  For example with below change:

+       switch (rcode)
+       {
+         case VEC_DUPLICATE:
+           *total += get_vector_costs ()->regmove->GR2VR * COSTS_N_INSNS (1);
+           break;
+         case PLUS:
+           {
+           rtx op_0 = XEXP (x, 0);     
+           rtx op_1 = XEXP (x, 1);
+           if (GET_CODE (op_0) == VEC_DUPLICATE
+               || GET_CODE (op_1) == VEC_DUPLICATE)
+             *total += get_vector_costs ()->regmove->GR2VR * COSTS_N_INSNS (1);
+           else
+             *total = COSTS_N_INSNS (1);
+           break;
+           }
+         default:
+           *total = COSTS_N_INSNS (1);
+           break;
+       }
+
+       return true;

For case_0, GR2VR is 0, we will have late-combine as blow:
  51   │ trying to combine definition of r135 in:
  52   │    11: r135:RVVM1SI=vec_duplicate(r150:DI#0)
  53   │ into:
  54   │    18: r147:RVVM1SI=r146:RVVM1SI+r135:RVVM1SI
  55   │       REG_DEAD r146:RVVM1SI
  56   │ successfully matched this instruction to *add_vx_rvvm1si:
  57   │ (set (reg:RVVM1SI 147 [ vect__6.8_16 ])
  58   │     (plus:RVVM1SI (vec_duplicate:RVVM1SI (subreg/s/u:SI (reg:DI 150 [ 
x ]) 0))
  59   │         (reg:RVVM1SI 146)))
  60   │ original cost = 8 + 4 (weighted: 39.483637), replacement cost = 8 
(weighted: 64.727273); rejecting replacement

For case_0, GR2VR is 1, we will have late-combine as blow:
  51   │ trying to combine definition of r135 in:
  52   │    11: r135:RVVM1SI=vec_duplicate(r150:DI#0)
  53   │ into:
  54   │    18: r147:RVVM1SI=r146:RVVM1SI+r135:RVVM1SI
  55   │       REG_DEAD r146:RVVM1SI
  56   │ successfully matched this instruction to *add_vx_rvvm1si:
  57   │ (set (reg:RVVM1SI 147 [ vect__6.8_16 ])
  58   │     (plus:RVVM1SI (vec_duplicate:RVVM1SI (subreg/s/u:SI (reg:DI 150 [ 
x ]) 0))
  59   │         (reg:RVVM1SI 146)))
  60   │ original cost = 12 + 4 (weighted: 43.043637), replacement cost = 12 
(weighted: 97.090910); rejecting replacement

For case_0, GR2VR is 2, we will have late-combine as blow:
  51   │ trying to combine definition of r135 in:
  52   │    11: r135:RVVM1SI=vec_duplicate(r150:DI#0)
  53   │ into:
  54   │    18: r147:RVVM1SI=r146:RVVM1SI+r135:RVVM1SI
  55   │       REG_DEAD r146:RVVM1SI
  56   │ successfully matched this instruction to *add_vx_rvvm1si:
  57   │ (set (reg:RVVM1SI 147 [ vect__6.8_16 ])
  58   │     (plus:RVVM1SI (vec_duplicate:RVVM1SI (subreg/s/u:SI (reg:DI 150 [ 
x ]) 0))
  59   │         (reg:RVVM1SI 146)))
  60   │ original cost = 16 + 4 (weighted: 46.603637), replacement cost = 16 
(weighted: 129.454546); rejecting replacement

The vadd v, vec_dup(x) seems has the same cost as vec_dup here. I am also 
confused about the how we calculate the
vadd v, vec_dup(x), can we just set its' cost to vadd.vx? given we have 
define_insn_and_split to match the pattern and
emit the vadd.vx directly. And it matches the expr we mentioned vadd.vv + vec 
== vadd.vx.
Please help to correct me if misunderstanding.

------------- cut line -------------

> Could you give a short example where that doesn't work so I can have a quick 
> look myself?

Sorry for misleading, it is not something doesn't work, just would like to make 
it clean before next step as it
is quit complicate up to a point.
However, I use below example for testing, case_0 for only vadd.vv in loop, 
while case_1 for vmv and vadd.vv in loop.

   1   │ #include <stdint.h>
   2   │
   3   │ #define T int32_t
   4   │ #define OP +
   5   │
   6   │ void
   7   │ test_vx_binary_case_0 (T * restrict out, T * restrict in, T x, 
unsigned n)
   8   │ {
   9   │   for (unsigned i = 0; i < n; i++)
  10   │     out[i] = in[i] OP x;
  11   │ }
  12   │
  13   │ void
  14   │ test_vx_binary_case_1 (T * restrict out, T * restrict a, T b, unsigned 
n)
  15   │ {
  16   │   unsigned k = 0;
  17   │
  18   │   T xb = b + 3;
  19   │
  20   │   while (k < n) {
  21   │     out[k + 0] = a[k + 0] OP xb;
  22   │     out[k + 1] = a[k + 1] OP xb;
  23   │     out[k + 2] = a[k + 2] OP xb;
  24   │     out[k + 3] = a[k + 3] OP xb;
  25   │     k += 4;
  26   │
  27   │     xb = xb ^ 0x3f;
  28   │
  29   │     out[k + 0] = a[k + 0] OP xb;
  30   │     out[k + 1] = a[k + 1] OP xb;
  31   │     out[k + 2] = a[k + 2] OP xb;
  32   │     out[k + 3] = a[k + 3] OP xb;
  33   │     k += 4;
  34   │   }
  35   │ }

Pan

-----Original Message-----
From: Robin Dapp <rdapp....@gmail.com> 
Sent: Tuesday, April 29, 2025 2:31 PM
To: Li, Pan2 <pan2...@intel.com>; Robin Dapp <rdapp....@gmail.com>; 
gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; jeffreya...@gmail.com; Chen, 
Ken <ken.c...@intel.com>; Liu, Hongtao <hongtao....@intel.com>; Robin Dapp 
<rdapp....@gmail.com>
Subject: Re: [PATCH v2 1/3] RISC-V: Combine vec_duplicate + vadd.vv to vadd.vx 
on GR2VR cost

> But this is not that good enough here if my understanding is correct.
> As vmv.v.x is somehow equivalent to vec_dup but doesn't ref GR2VR,

But it should.  Can't we do something like:

  if (riscv_v_ext_mode_p (mode))
    {
      switch (GET_CODE (x))
        {
          case VEC_DUPLICATE:
            // will add 0, 4, 8 in case of GR2VR = 0, 1, 2
            *total += GR2VR * COST_N_INSNS (1);
            break;
          case PLUS:
            rtx op2 = XEXP (XEXP (x, 1), 0);
            if (GET_CODE (op2) == VEC_DUPLICATE)
              *total += GR2VR * COST_N_INSNS (1);
            ...
            break;


Could you give a short example where that doesn't work so I can have a quick 
look myself?

Thanks.

-- 
Regards
 Robin

Reply via email to