Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v4]

Xiaohong Gong Fri, 13 May 2022 02:58:03 -0700

> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies<Byte> species,
>                        byte[] a, int offset,
>                        VectorMask<Byte> m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>       if (offset >= 0 && offset <= (a.length - species.length())) {
>           return vsp.dummyVector().fromArray0(a, offset, m);
>       }
> 
>       // FIXME: optimize
>       checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>       return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark                                          before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE    233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.


Xiaohong Gong has updated the pull request incrementally with one additional 
commit since the last revision:

  Use integer constant for offsetInRange all the way through

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8035/files
  - new: https://git.openjdk.java.net/jdk/pull/8035/files/9c69206e..07edfbd5

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02-03

  Stats: 438 lines in 39 files changed: 33 ins; 118 del; 287 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035

Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v4]

Reply via email to