>> We might also want to use a polynomial right away, with two or three 
>> predefined?  Quadratic seems crazy, though, not sure how or why hardware 
>> would 
>> do that :)
>
> Quadratic is chosen so that higher LMULs are penalized more than lower LMULs.
> When a loop has a low number of iterations (say, 6) at runtime, and the
> vectorized loop only iterates once for LMUL=1, the higher the LMUL, the slower
> the code.  The current comparison logic compares the average cost per 
> iteration,
> meaning that when we use a linear cost, for vectorized code that only differs
> in the LMUL selection, LMUL=1 and LMUL=8 has the same cost, because the LMUL
> factor appears both in the dividend and the divisor.  Thus, I use a quadratic
> LMUL factor (that is, the cost is multiplied by LMUL again) here to show the
> actual cost in such "worst-case" scenarios.  For targets that have a linear 
> LMUL
> cost, this selection will not be VERY bad even if the number of iterations is
> high.  When the number of iterations is low, this is beneficial.  This is a
> somewhat extreme worst-case choice, though, and something between linear and
> quadratic may be desired.  It's purely heuristic, and not that related to the
> actual hardware instruction cost (, which is assumed linear throughout my
> discussion).
>
> I think I got wrong in the VLS part, though.  For VLS vectorization, the
> remaining part goes to the epilogue, so for the VLS-vectorized part the cost
> should indeed be linear.

Right now the heuristic seems very much geared towards making that specific 
function in x264 run fast, and just making something quadratic doesn't exactly 
counter that argument :)

The major tension we have is that the vectorizer doesn't emit a runtime 
iteration check for partial vectorization like RVV's.  The assumption is that 
length masking is always cheap.  That assumption does not hold when the latency 
is just dependent on the vector size (=LMUL).  Therefore, another approach 
could be to either disable partial vectorization altogether (you can how 
--param=vect-partial-vector-usage=0 works for you)
if vl_dependent_lmul_scaling == true (very likely too big a hammer) or define a
new channel to let the vectorizer know we _do_ want a runtime check despite 
partial vectorization under specific circumstances.

Generally, a heuristic that might be reasonable could be "If the loop is
length controlled (with compile-time unknown niters) and latency depends on 
LMUL rather than VL, try to be less aggressive on LMUL.".

The rationale would be something like: "Assuming a standard distribution of
VL around a 'normal' value, small VLs with high LMUL cause disproportionally
high latency that is not amortized by the speedup we get from large VL with 
high LMUL."

That's still pretty shaky and we'd need to see if people consider this 
"benchmark hacking" or not.

Anyway, IMHO you want something like:
 if (LOOP_VINFO_FULLY_WITH_LENGTH_P (...) && vl_dependent_lmul_scaling
     && niter...)
  lmul_factor = scale_lmul (...);

I'm not sure we want to expose this as a param, maybe?

Also, regarding costing:  Right now we don't factor in scalar units.  What we 
actually should do is additionally scale the vector costs by the scalar 
ILP/units (or rather the ratio #scalar units/#vector units).  Maybe that will 
help a bit here?

Still, I think the most straightforward approach for this loop is LTO and "just 
knowing" the number of iterations.

-- 
Regards
 Robin

Reply via email to