On Thu, 5 May 2022 02:00:04 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

> Currently the vectorization of masked vector store is implemented by the 
> masked store instruction only on architectures that support the predicate 
> feature. The compiler will fall back to the java scalar code for 
> non-predicate supported architectures like ARM NEON. However, for these 
> systems, the masked store can be vectorized with the non-masked vector `"load 
> + blend + store"`. For example, storing a vector` "v"` controlled by a mask` 
> "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be 
> implemented with:
>  1) mem_v = load(addr)     ; non-masked load from the same memory
>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>  3) store(addr, v)         ; non-masked store into the memory
> Since the first full loading needs the array offset must be inside of the 
> valid array bounds, we make the compiler do the vectorization only when the 
> offset is in range of the array boundary. And the compiler will still fall 
> back to the java scalar code if not all offsets are valid. Besides, the 
> original offset check for masked lanes are only applied when the offset is 
> not always inside of the array range. This also improves the performance for 
> masked store when the offset is always valid. The whole process is similar to 
> the masked load API.
> Here is the performance data for the masked vector store benchmarks on a X86 
> non avx-512 system, which improves about `20x ~ 50x`:
> Benchmark                                  before    after   Units
> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask     75.028   2027.557 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask    40.929   1032.928 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
> Similar performance gain can also be observed on ARM NEON system.
> And here is the performance data on X86 avx-512 system, which improves about 
> `1.88x - 2.81x`:
> Benchmark                                  before     after   Units
> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask     4191.904  9300.428 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask    2025.031  4604.504 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
> Similar performance gain can also be observed on ARM SVE system.

This pull request has been closed without being integrated.


PR: https://git.openjdk.java.net/jdk/pull/8544

Reply via email to