skachkov-sc wrote:
> I think you can probably make this independent of #140721 by first just
> supporting cases where to compressed store does not alias any of the other
> memory accesses?
Yes, the changes in LAA are fully independent, we can skip them for now.
> Curious if you already have any runtime performance numbers you could share?
We've benchmarked the following loop pattern:
```
// benchmark() is run 32 times
template<typename T>
void benchmark(T *dst, const T *src) {
size_t idx = 0;
for(size_t i = 0; i < 1024; ++i) {
T cur = src[i];
if (cur != static_cast<T>(0))
dst[idx++] = cur;
}
dst[idx] = static_cast<T>(0);
}
```
On SpacemiT-X60 core (RISC-V CPU with VLEN=256) the results are following:
| Type | cycles (scalar) | cycles (vector) | speedup |
| ---------|---------------------|----------------------|-------------|
| int16_t | 189151 | 56795 | 3.33x |
| int32_t | 205712 | 87196 | 2.36x |
| int64_t | 205757 | 150115 | 1.37x |
There were no branch mispredicts for `if (cur != static_cast<T>(0))` branch in
scalar case here (due to the specifics of data in src array), so I think the
speedup can be even bigger for more random inputs. We haven't observed any
significant changes on SPECs though.
https://github.com/llvm/llvm-project/pull/140723
_______________________________________________
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits