> Quadratic is chosen so that higher LMULs are penalized more than lower LMULs.
> When a loop has a low number of iterations (say, 6) at runtime, and the
> vectorized loop only iterates once for LMUL=1,
> the higher the LMUL,  the slower the code.

That's not true for all cores. SiFive cores are implemented as `Olvt`, so
VL=1 results in the same latency for both LMUL=1 and LMUL=8.

I am not opposed to adding this as a new parameter, but I do oppose making it
the default. It should be disabled by default and enabled only for cores whose
owners explicitly confirm that this model is appropriate.

Reply via email to