On Tue, May 19, 2026 at 5:44 PM Robin Dapp <[email protected]> wrote:
>
> > The major tension we have is that the vectorizer doesn't emit a runtime
> > iteration check for partial vectorization like RVV's.  The assumption is 
> > that
> > length masking is always cheap.

Btw, it's still my intention to "fix" this.  On x86 this issue is present for
a small number of actual scalar iterations.

> >  That assumption does not hold when the
> > latency is just dependent on the vector size (=LMUL).  Therefore, another
> > approach could be to either disable partial vectorization altogether (you 
> > can
> > how --param=vect-partial-vector-usage=0 works for you)
> > if vl_dependent_lmul_scaling == true (very likely too big a hammer) or 
> > define a
> > new channel to let the vectorizer know we _do_ want a runtime check despite
> > partial vectorization under specific circumstances.
> >
> > Generally, a heuristic that might be reasonable could be "If the loop is
> > length controlled (with compile-time unknown niters) and latency depends on
> > LMUL rather than VL, try to be less aggressive on LMUL.".
> >
> > The rationale would be something like: "Assuming a standard distribution of
> > VL around a 'normal' value, small VLs with high LMUL cause disproportionally
> > high latency that is not amortized by the speedup we get from large VL with
> > high LMUL."
> >
> > That's still pretty shaky and we'd need to see if people consider this
> > "benchmark hacking" or not.
> >
> > Anyway, IMHO you want something like:
> >  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (...) && vl_dependent_lmul_scaling
> >      && niter...)
> >   lmul_factor = scale_lmul (...);
>
> As a follow up from today's meeting.  Questions to be resolved still:
>  - Experiment with the above condition and check if they work.  If so, let's
>    discuss again next week.
>  - Does LTO help with figuring out the runtime unknown length here?  If not, 
> we
>    might need a bug report.
>  - How does --param=vect-partial-vector-usage=0 perform for loops you're
>    interested in?  I don't think Richi would like it too much but if we had
>    the ability to determine "partial vectorization yes/no" per mode, rather
>    than per target, we could go for e.g. a SIMD-style main loop and a 
> partially
>    vectorized epilogue (that would be limited to LMUL1).  That would be 
> closest
>    to vector loop versioning and likely better than cost scaling.
>
> --
> Regards
>  Robin
>

Reply via email to