> We might also want to use a polynomial right away, with two or three > predefined? Quadratic seems crazy, though, not sure how or why hardware > would > do that :)
Quadratic is chosen so that higher LMULs are penalized more than lower LMULs. When a loop has a low number of iterations (say, 6) at runtime, and the vectorized loop only iterates once for LMUL=1, the higher the LMUL, the slower the code. The current comparison logic compares the average cost per iteration, meaning that when we use a linear cost, for vectorized code that only differs in the LMUL selection, LMUL=1 and LMUL=8 has the same cost, because the LMUL factor appears both in the dividend and the divisor. Thus, I use a quadratic LMUL factor (that is, the cost is multiplied by LMUL again) here to show the actual cost in such "worst-case" scenarios. For targets that have a linear LMUL cost, this selection will not be VERY bad even if the number of iterations is high. When the number of iterations is low, this is beneficial. This is a somewhat extreme worst-case choice, though, and something between linear and quadratic may be desired. It's purely heuristic, and not that related to the actual hardware instruction cost (, which is assumed linear throughout my discussion). I think I got wrong in the VLS part, though. For VLS vectorization, the remaining part goes to the epilogue, so for the VLS-vectorized part the cost should indeed be linear.
