On Tue, May 19, 2026 at 5:44 PM Robin Dapp <[email protected]> wrote: > > > The major tension we have is that the vectorizer doesn't emit a runtime > > iteration check for partial vectorization like RVV's. The assumption is > > that > > length masking is always cheap.
Btw, it's still my intention to "fix" this. On x86 this issue is present for a small number of actual scalar iterations. > > That assumption does not hold when the > > latency is just dependent on the vector size (=LMUL). Therefore, another > > approach could be to either disable partial vectorization altogether (you > > can > > how --param=vect-partial-vector-usage=0 works for you) > > if vl_dependent_lmul_scaling == true (very likely too big a hammer) or > > define a > > new channel to let the vectorizer know we _do_ want a runtime check despite > > partial vectorization under specific circumstances. > > > > Generally, a heuristic that might be reasonable could be "If the loop is > > length controlled (with compile-time unknown niters) and latency depends on > > LMUL rather than VL, try to be less aggressive on LMUL.". > > > > The rationale would be something like: "Assuming a standard distribution of > > VL around a 'normal' value, small VLs with high LMUL cause disproportionally > > high latency that is not amortized by the speedup we get from large VL with > > high LMUL." > > > > That's still pretty shaky and we'd need to see if people consider this > > "benchmark hacking" or not. > > > > Anyway, IMHO you want something like: > > if (LOOP_VINFO_FULLY_WITH_LENGTH_P (...) && vl_dependent_lmul_scaling > > && niter...) > > lmul_factor = scale_lmul (...); > > As a follow up from today's meeting. Questions to be resolved still: > - Experiment with the above condition and check if they work. If so, let's > discuss again next week. > - Does LTO help with figuring out the runtime unknown length here? If not, > we > might need a bug report. > - How does --param=vect-partial-vector-usage=0 perform for loops you're > interested in? I don't think Richi would like it too much but if we had > the ability to determine "partial vectorization yes/no" per mode, rather > than per target, we could go for e.g. a SIMD-style main loop and a > partially > vectorized epilogue (that would be limited to LMUL1). That would be > closest > to vector loop versioning and likely better than cost scaling. > > -- > Regards > Robin >
