> Quadratic is chosen so that higher LMULs are penalized more than lower LMULs. > When a loop has a low number of iterations (say, 6) at runtime, and the > vectorized loop only iterates once for LMUL=1, > the higher the LMUL, the slower the code.
That's not true for all cores. SiFive cores are implemented as `Olvt`, so VL=1 results in the same latency for both LMUL=1 and LMUL=8. I am not opposed to adding this as a new parameter, but I do oppose making it the default. It should be disabled by default and enabled only for cores whose owners explicitly confirm that this model is appropriate.
