https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125476

--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Zhongyao Chen from comment #2)
> I found another testcase when I test with my local patch that scales
> RVV vector body costs by a default scalar/vector unit ratio of 2, 
> I see many regressions locally (69 test files).
> 
> One example is `vx-5-i64.c`. It can be reduced to:
> 
> ```c
> #include <stdint.h>
> 
> void
> test_vx_binary_add_int64_t_case_1 (int64_t *restrict out,
>                                     int64_t *restrict in,
>                                     int64_t x, unsigned n)
> {
> unsigned k = 0;
> int64_t tmp = x + 3;
> 
> while (k < n)
>     {
>     tmp = tmp ^ 0x3f;
>     out[k + 0] = in[k + 0] + tmp;
>     out[k + 1] = in[k + 1] + tmp;
>     k += 2;
>     }
> }
> ```
> 
> On trunk, this is SLP-vectorized.
> 
> The vector cost is 6:
> 
> - vector_load = 2
> - vector_stmt = 1
> - vector_store = 1
> - scalar_to_vec = 2
> 
> The scalar cost is also 6.
> 
> With trunk + my scalar/vector ratio patch, it falls back to scalar code
> because the vector cost becomes 7 while the scalar cost stays 6.
> 
> That does not look right to me. I still expect it to be vectorized.
> 
> I think the main issue is the scalar_to_vec cost. It looks overcounted here.
> 
> There is no explicit `vmv.v.x` or other standalone scalar-to-vector
> instruction in the final assembly.  Instead, it generates `vadd.vx v1,v1,a2`.
> 
> So SLP costing seems to charge a virtual scalar_to_vec cost
> even when the final lowering can use a .vx form directly.

Those vector-scalar variants are a bit difficult to cost early because we don't
really know if we're going to commit to the vx variant later.  That depends on
rtx costs (and other factors).  So in the vectorizer, we always include a
"scalar_to_vec" overhead.

On many uarchs that's actually correct, though and the vadd.vx will be slower,
in particular when tmp is on the critical path.  So IMHO not vectorizing is not
too unreasonable.  When costing is "on the brink" like in this contrived
example here, a small change can cause that difference.  Just assume a vadd.vx
costs +2 and with 2x scalar ALUs you could squeeze 4 operations into these 2
cycles.

On uarchs that have "free" vx variants vector costing is off, of course.

Reply via email to