>> We might also want to use a polynomial right away, with two or three
>> predefined? Quadratic seems crazy, though, not sure how or why hardware
>> would
>> do that :)
>
> Quadratic is chosen so that higher LMULs are penalized more than lower LMULs.
> When a loop has a low number of iterations (say, 6) at runtime, and the
> vectorized loop only iterates once for LMUL=1, the higher the LMUL, the slower
> the code. The current comparison logic compares the average cost per
> iteration,
> meaning that when we use a linear cost, for vectorized code that only differs
> in the LMUL selection, LMUL=1 and LMUL=8 has the same cost, because the LMUL
> factor appears both in the dividend and the divisor. Thus, I use a quadratic
> LMUL factor (that is, the cost is multiplied by LMUL again) here to show the
> actual cost in such "worst-case" scenarios. For targets that have a linear
> LMUL
> cost, this selection will not be VERY bad even if the number of iterations is
> high. When the number of iterations is low, this is beneficial. This is a
> somewhat extreme worst-case choice, though, and something between linear and
> quadratic may be desired. It's purely heuristic, and not that related to the
> actual hardware instruction cost (, which is assumed linear throughout my
> discussion).
>
> I think I got wrong in the VLS part, though. For VLS vectorization, the
> remaining part goes to the epilogue, so for the VLS-vectorized part the cost
> should indeed be linear.
Right now the heuristic seems very much geared towards making that specific
function in x264 run fast, and just making something quadratic doesn't exactly
counter that argument :)
The major tension we have is that the vectorizer doesn't emit a runtime
iteration check for partial vectorization like RVV's. The assumption is that
length masking is always cheap. That assumption does not hold when the latency
is just dependent on the vector size (=LMUL). Therefore, another approach
could be to either disable partial vectorization altogether (you can how
--param=vect-partial-vector-usage=0 works for you)
if vl_dependent_lmul_scaling == true (very likely too big a hammer) or define a
new channel to let the vectorizer know we _do_ want a runtime check despite
partial vectorization under specific circumstances.
Generally, a heuristic that might be reasonable could be "If the loop is
length controlled (with compile-time unknown niters) and latency depends on
LMUL rather than VL, try to be less aggressive on LMUL.".
The rationale would be something like: "Assuming a standard distribution of
VL around a 'normal' value, small VLs with high LMUL cause disproportionally
high latency that is not amortized by the speedup we get from large VL with
high LMUL."
That's still pretty shaky and we'd need to see if people consider this
"benchmark hacking" or not.
Anyway, IMHO you want something like:
if (LOOP_VINFO_FULLY_WITH_LENGTH_P (...) && vl_dependent_lmul_scaling
&& niter...)
lmul_factor = scale_lmul (...);
I'm not sure we want to expose this as a param, maybe?
Also, regarding costing: Right now we don't factor in scalar units. What we
actually should do is additionally scale the vector costs by the scalar
ILP/units (or rather the ratio #scalar units/#vector units). Maybe that will
help a bit here?
Still, I think the most straightforward approach for this loop is LTO and "just
knowing" the number of iterations.
--
Regards
Robin