> The major tension we have is that the vectorizer doesn't emit a runtime 
> iteration check for partial vectorization like RVV's.  The assumption is that 
> length masking is always cheap.  That assumption does not hold when the 
> latency is just dependent on the vector size (=LMUL).  Therefore, another 
> approach could be to either disable partial vectorization altogether (you can 
> how --param=vect-partial-vector-usage=0 works for you)
> if vl_dependent_lmul_scaling == true (very likely too big a hammer) or define 
> a
> new channel to let the vectorizer know we _do_ want a runtime check despite 
> partial vectorization under specific circumstances.
>
> Generally, a heuristic that might be reasonable could be "If the loop is
> length controlled (with compile-time unknown niters) and latency depends on 
> LMUL rather than VL, try to be less aggressive on LMUL.".
>
> The rationale would be something like: "Assuming a standard distribution of
> VL around a 'normal' value, small VLs with high LMUL cause disproportionally
> high latency that is not amortized by the speedup we get from large VL with 
> high LMUL."
>
> That's still pretty shaky and we'd need to see if people consider this 
> "benchmark hacking" or not.
>
> Anyway, IMHO you want something like:
>  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (...) && vl_dependent_lmul_scaling
>      && niter...)
>   lmul_factor = scale_lmul (...);

As a follow up from today's meeting.  Questions to be resolved still:
 - Experiment with the above condition and check if they work.  If so, let's 
   discuss again next week.
 - Does LTO help with figuring out the runtime unknown length here?  If not, we 
   might need a bug report.
 - How does --param=vect-partial-vector-usage=0 perform for loops you're 
   interested in?  I don't think Richi would like it too much but if we had
   the ability to determine "partial vectorization yes/no" per mode, rather 
   than per target, we could go for e.g. a SIMD-style main loop and a partially 
   vectorized epilogue (that would be limited to LMUL1).  That would be closest 
   to vector loop versioning and likely better than cost scaling.

-- 
Regards
 Robin

Reply via email to