On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

>> Currently the vectorization of masked vector store is implemented by the 
>> masked store instruction only on architectures that support the predicate 
>> feature. The compiler will fall back to the java scalar code for 
>> non-predicate supported architectures like ARM NEON. However, for these 
>> systems, the masked store can be vectorized with the non-masked vector 
>> `"load + blend + store"`. For example, storing a vector` "v"` controlled by 
>> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` 
>> can be implemented with:
>> 
>> 
>>  1) mem_v = load(addr)     ; non-masked load from the same memory
>>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>>  3) store(addr, v)         ; non-masked store into the memory
>> 
>> 
>> Since the first full loading needs the array offset must be inside of the 
>> valid array bounds, we make the compiler do the vectorization only when the 
>> offset is in range of the array boundary. And the compiler will still fall 
>> back to the java scalar code if not all offsets are valid. Besides, the 
>> original offset check for masked lanes are only applied when the offset is 
>> not always inside of the array range. This also improves the performance for 
>> masked store when the offset is always valid. The whole process is similar 
>> to the masked load API.
>> 
>> Here is the performance data for the masked vector store benchmarks on a X86 
>> non avx-512 system, which improves about `20x ~ 50x`:
>> 
>> Benchmark                                  before    after   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask     75.028   2027.557 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask    40.929   1032.928 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
>> 
>> Similar performance gain can also be observed on ARM NEON system.
>> 
>> And here is the performance data on X86 avx-512 system, which improves about 
>> `1.88x - 2.81x`:
>> 
>> Benchmark                                  before     after   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask     4191.904  9300.428 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask    2025.031  4604.504 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
>> 
>> Similar performance gain can also be observed on ARM SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains one commit:
> 
>   8284050: [vectorapi] Optimize masked store for non-predicated architectures

> _Mailing list message from [Hans Boehm](mailto:hbo...@google.com) on 
> [hotspot-dev](mailto:hotspot-...@mail.openjdk.java.net):_
> 
> Naive question: What happens if one of the vector elements that should not 
> have been updated is concurrently being written by another thread? Aren't you 
> generating writes to vector elements that should not have been written?
> 
> Hans
> 
> On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong <xgong at openjdk.java.net> 
> wrote:

Yeah, this is the similar concern with what @rose00 mentioned above. The 
current solution cannot work well for multi-thread progresses. I will consider 
other better solutions. Thanks for the comments!

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 3483:

> 3481:             ByteSpecies vsp = vspecies();
> 3482:             if (offset >= 0 && offset <= (a.length - vsp.length())) {
> 3483:                 intoBooleanArray0(a, offset, m, /* offsetInRange */ 
> true);

The offset check could save the `checkMaskFromIndexSize`  for cases that offset 
are in the valid array bounds, which also improves the performance. @rose00 , 
do you think this part of change is ok at least? Thanks!

-------------

PR: https://git.openjdk.java.net/jdk/pull/8544

Reply via email to