Currently the vector load with mask when the given index happens out of the 
array boundary is implemented with pure java scalar code to avoid the IOOBE 
(IndexOutOfBoundaryException). This is necessary for architectures that do not 
support the predicate feature. Because the masked load is implemented with a 
full vector load and a vector blend applied on it. And a full vector load will 
definitely cause the IOOBE which is not valid. However, for architectures that 
support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with 
the predicated load instruction as long as the indexes of the masked lanes are 
within the bounds of the array. For these architectures, loading with unmasked 
lanes does not raise exception.

This patch adds the vectorization support for the masked load with IOOBE part. 
Please see the original java implementation (FIXME: optimize):


  @ForceInline
  public static
  ByteVector fromArray(VectorSpecies<Byte> species,
                       byte[] a, int offset,
                       VectorMask<Byte> m) {
  ByteSpecies vsp = (ByteSpecies) species;
      if (offset >= 0 && offset <= (a.length - species.length())) {
          return vsp.dummyVector().fromArray0(a, offset, m);
      }

      // FIXME: optimize
      checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
      return vsp.vOp(m, i -> a[offset + i]);
  }

Since it can only be vectorized with the predicate load, the hotspot must check 
whether the current backend supports it and falls back to the java scalar 
version if not. This is different from the normal masked vector load that the 
compiler will generate a full vector load and a vector blend if the predicate 
load is not supported. So to let the compiler make the expected action, an 
additional flag (i.e. `usePred`) is added to the existing "loadMasked" 
intrinsic, with the value "true" for the IOOBE part while "false" for the 
normal load. And the compiler will fail to intrinsify if the flag is "true" and 
the predicate load is not supported by the backend, which means that normal 
java path will be executed.

Also adds the same vectorization support for masked:
 - fromByteArray/fromByteBuffer
 - fromBooleanArray
 - fromCharArray

The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on 
the x86 AVX-512 system:

Benchmark                                          before   After  Units
LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE    233.816 7075.923 ops/ms
LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms

Similar performance gain can also be observed on 512-bit SVE system.

-------------

Commit messages:
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8283667
  Stats: 821 lines in 43 files changed: 314 ins; 117 del; 390 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035

Reply via email to