On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

>> Currently the vectorization of masked vector store is implemented by the 
>> masked store instruction only on architectures that support the predicate 
>> feature. The compiler will fall back to the java scalar code for 
>> non-predicate supported architectures like ARM NEON. However, for these 
>> systems, the masked store can be vectorized with the non-masked vector 
>> `"load + blend + store"`. For example, storing a vector` "v"` controlled by 
>> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` 
>> can be implemented with:
>> 
>> 
>>  1) mem_v = load(addr)     ; non-masked load from the same memory
>>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>>  3) store(addr, v)         ; non-masked store into the memory
>> 
>> 
>> Since the first full loading needs the array offset must be inside of the 
>> valid array bounds, we make the compiler do the vectorization only when the 
>> offset is in range of the array boundary. And the compiler will still fall 
>> back to the java scalar code if not all offsets are valid. Besides, the 
>> original offset check for masked lanes are only applied when the offset is 
>> not always inside of the array range. This also improves the performance for 
>> masked store when the offset is always valid. The whole process is similar 
>> to the masked load API.
>> 
>> Here is the performance data for the masked vector store benchmarks on a X86 
>> non avx-512 system, which improves about `20x ~ 50x`:
>> 
>> Benchmark                                  before    after   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask     75.028   2027.557 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask    40.929   1032.928 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
>> 
>> Similar performance gain can also be observed on ARM NEON system.
>> 
>> And here is the performance data on X86 avx-512 system, which improves about 
>> `1.88x - 2.81x`:
>> 
>> Benchmark                                  before     after   Units
>> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask     4191.904  9300.428 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask    2025.031  4604.504 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
>> 
>> Similar performance gain can also be observed on ARM SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains one commit:
> 
>   8284050: [vectorapi] Optimize masked store for non-predicated architectures

The JIT (in all other circumstances AFAIK) never produces "phantom stores", 
stores into Java variables which are not specified as the target of a JVM store 
instruction (putfield, dastore, etc.).  The fact that a previously-read value 
is used by the phantom store does not make it any better.

Yes, the memory states may be correct after the blend and store is done, but 
the effect on the Java Memory Model is to issue the extra phantom stores of the 
unselected array elements.  Under certain circumstances, this will create race 
conditions after the optimization where there were no race conditions before 
the optimization.  Other threads could (under Java Memory Model rules) witness 
the effects of the phantom stores.  If the Java program is properly 
synchronized, the introduction of an illegitimate race condition can cause 
another thread, now in an illegal race, to see an old value in a variable (the 
recopied unselected array element) which the JMM says is impossible.

Yes, this only shows up in multi-threaded programs, and ones where two threads 
step on one array, but Java is a multi-threaded language, and it must conform 
to its own specification as such.

This blend technique would be very reasonable if there is no race condition.  
(Except at the very beginning or end of arrays.)  And the JMM leaves room for 
many optimizations.  And yet I think this is a step too far.  I'd like to be 
wrong about this, but I don't think I am.

So, I think you need to use a different technique, other than 
blend-and-unpredicated-store, for masked stores on non-predicated architectures.

For example, you could simulate a masked store instruction on an architecture 
that supports scatter (scattering values of the array type).  Do this by 
setting up two vectors of machine pointers.  One vector points to each 
potentially-affected element of the array (some kind of index computation plus 
a scaled iota vector).  The other vector is set up similarly, but points into a 
fixed-sized, thread-local buffer, what I would call the "bit bucket".  Blend 
the addresses, and then scatter, so that selected array lanes are updated, and 
unselected values are sent to the "bit bucket".

This is complex enough (and platform-dependent enough) that you probably need 
to write a hand-coded assembly language subroutine, to call from the JIT code.  
Sort of like the arraycopy stubs.

It's even more work than the proposed patch here, but it's the right thing, I'm 
afraid.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1363:

> 1361:       // Use the vector blend to implement the masked store. The biased 
> elements are the original
> 1362:       // values in the memory.
> 1363:       Node* mem_val = gvn().transform(LoadVectorNode::make(0, 
> control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt));

I'm sorry to say it, but I am pretty sure this is an invalid optimization.
See top-level comment for more details.

-------------

Changes requested by jrose (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/8544

Reply via email to