On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> Currently the vectorization of masked vector store is implemented by the >> masked store instruction only on architectures that support the predicate >> feature. The compiler will fall back to the java scalar code for >> non-predicate supported architectures like ARM NEON. However, for these >> systems, the masked store can be vectorized with the non-masked vector >> `"load + blend + store"`. For example, storing a vector` "v"` controlled by >> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` >> can be implemented with: >> >> >> 1) mem_v = load(addr) ; non-masked load from the same memory >> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask >> 3) store(addr, v) ; non-masked store into the memory >> >> >> Since the first full loading needs the array offset must be inside of the >> valid array bounds, we make the compiler do the vectorization only when the >> offset is in range of the array boundary. And the compiler will still fall >> back to the java scalar code if not all offsets are valid. Besides, the >> original offset check for masked lanes are only applied when the offset is >> not always inside of the array range. This also improves the performance for >> masked store when the offset is always valid. The whole process is similar >> to the masked load API. >> >> Here is the performance data for the masked vector store benchmarks on a X86 >> non avx-512 system, which improves about `20x ~ 50x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms >> >> Similar performance gain can also be observed on ARM NEON system. >> >> And here is the performance data on X86 avx-512 system, which improves about >> `1.88x - 2.81x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms >> >> Similar performance gain can also be observed on ARM SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains one commit: > > 8284050: [vectorapi] Optimize masked store for non-predicated architectures > _Mailing list message from [Hans Boehm](mailto:hbo...@google.com) on > [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_ > > Naive question: What happens if one of the vector elements that should not > have been updated is concurrently being written by another thread? Aren't you > generating writes to vector elements that should not have been written? > > Hans > > On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong <xgong at openjdk.java.net> > wrote: Yeah, this is the similar concern with what @rose00 mentioned above. The current solution cannot work well for multi-thread progresses. I will consider other better solutions. Thanks for the comments! src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3483: > 3481: ByteSpecies vsp = vspecies(); > 3482: if (offset >= 0 && offset <= (a.length - vsp.length())) { > 3483: intoBooleanArray0(a, offset, m, /* offsetInRange */ > true); The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8544