skachkov-sc wrote:

> I think you can probably make this independent of #140721 by first just 
> supporting cases where to compressed store does not alias any of the other 
> memory accesses?

Yes, the changes in LAA are fully independent, we can skip them for now.

> Curious if you already have any runtime performance numbers you could share?

We've benchmarked the following loop pattern:
```
// benchmark() is run 32 times

template<typename T>
void benchmark(T *dst, const T *src) {
  size_t idx = 0;
  for(size_t i = 0; i < 1024; ++i) {
    T cur = src[i];
    if (cur != static_cast<T>(0))
      dst[idx++] = cur;
  }
  dst[idx] = static_cast<T>(0);
}
```
On SpacemiT-X60 core (RISC-V CPU with VLEN=256) the results are following:

| Type    | cycles (scalar) | cycles (vector) | speedup |
| ---------|---------------------|----------------------|-------------|
| int16_t | 189151           | 56795               | 3.33x      |
| int32_t | 205712           | 87196               | 2.36x      |
| int64_t | 205757           | 150115             | 1.37x      |

There were no branch mispredicts for `if (cur != static_cast<T>(0))` branch in 
scalar case here (due to the specifics of data in src array), so I think the 
speedup can be even bigger for more random inputs. We haven't observed any 
significant changes on SPECs though.

https://github.com/llvm/llvm-project/pull/140723
_______________________________________________
llvm-branch-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

Reply via email to