On Wed, 19 Nov 2025, Robin Dapp wrote:

> > Yes, it's supposed to be a start.
> >
> > Btw, after Robin now implemented "type punning" for gather/stride loads
> > to handle power-of-two size groups load/store-lanes needs similar
> > support which would esp. help ARM where there's only up to ld4(?).
> 
> From what I heard so far nobody really likes segmented loads/stores 
> (=load/store lanes) on the riscv hardware side anyway.  On our design
> we have experimented with just replacing them with strided loads in
> the backend but that's not better either.  So, yes, it's on my list but
> the impact on performance is not clear to me yet.

I think the main difference between both is that a segmented load/store
is a contiguous access to memory with (de-)interleaving from/to multiple
registers while a strided load/store is not contiguous - the former
should be always better for the memory subsystem, using 4 strided loads
instead of a single segmented load is presenting it with 4 memory
streams (obfuscated from the single nice linear single memory stream).

So I believe that if the strided load case is faster on actual HW
that tells a lot about how "optimized" the whole thing is ...

Of course RVV allowing up to 10(?) segments puts quite a strain on
the required (de-)interleaving hardware, but I'd expect implementing
ld{2,3,4} "fast" and microcoding the rest is a reasonable approach
here.

Richard.

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to