> We might also want to use a polynomial right away, with two or three 
> predefined?  Quadratic seems crazy, though, not sure how or why hardware 
> would 
> do that :)

Quadratic is chosen so that higher LMULs are penalized more than lower LMULs.
When a loop has a low number of iterations (say, 6) at runtime, and the
vectorized loop only iterates once for LMUL=1, the higher the LMUL, the slower
the code.  The current comparison logic compares the average cost per iteration,
meaning that when we use a linear cost, for vectorized code that only differs
in the LMUL selection, LMUL=1 and LMUL=8 has the same cost, because the LMUL
factor appears both in the dividend and the divisor.  Thus, I use a quadratic
LMUL factor (that is, the cost is multiplied by LMUL again) here to show the
actual cost in such "worst-case" scenarios.  For targets that have a linear LMUL
cost, this selection will not be VERY bad even if the number of iterations is
high.  When the number of iterations is low, this is beneficial.  This is a
somewhat extreme worst-case choice, though, and something between linear and
quadratic may be desired.  It's purely heuristic, and not that related to the
actual hardware instruction cost (, which is assumed linear throughout my
discussion).

I think I got wrong in the VLS part, though.  For VLS vectorization, the
remaining part goes to the epilogue, so for the VLS-vectorized part the cost
should indeed be linear.

Reply via email to