Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 02:14:08 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: >> >>> 1230: // out when current case uses the predicate feature. >>> 1231: if (!supports_predicate) { >>> 1232: bool use_predicate = false; >> >> If we rename this to needs_predicate it will be easier to understand. > > Thanks for the comment! This local variable will be removed after adding the > similar intrinsify for store masked. Please help to see the PR > https://github.com/openjdk/jdk/pull/8544. Thanks so much! Renamed to "needs_predicate". Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 31 Mar 2022 02:15:26 GMT, Quan Anh Mai wrote: >> I'm afraid not. "Load + Blend" makes the elements of unmasked lanes to be >> `0`. Then a full store may change the values in the unmasked memory to be 0, >> which is different with the mask store API definition. > > The blend should be with the intended-to-store vector, so that masked lanes > contain the need-to-store elements and unmasked lanes contain the loaded > elements, which would be stored back, which results in unchanged values. Hi @merykitty @jatin-bhateja , could you please help to take a review at the similar store masked PR https://github.com/openjdk/jdk/pull/8544 ? Any feedback is welcome! Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 28 Apr 2022 00:13:49 GMT, Sandhya Viswanathan wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename the "usePred" to "offsetInRange" > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: > >> 1230: // out when current case uses the predicate feature. >> 1231: if (!supports_predicate) { >> 1232: bool use_predicate = false; > > If we rename this to needs_predicate it will be easier to understand. Thanks for the comment! This local variable will be removed after adding the similar intrinsify for store masked. Please help to see the PR https://github.com/openjdk/jdk/pull/8544. Thanks so much! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 01:42:48 GMT, Xiaohong Gong wrote: > > > Yeah, I agree that it's not good by adding a branch checking for > > > `offsetInRange`. But actually I met the constant issue that passing the > > > values all the way cannot guarantee the argument a constant in compiler > > > at the compile time. Do you have any better idea to fixing this? > > > > > > That's odd, `boolean` constants are passed that are then converted to `int` > > constants. Did you try passing integer constants all the way through? > > I will try again. I remember the main cause is the calling of `fromArray0` > from `fromArray`, it is not annotated with `ForceInline`. The arguments might > not be compiled to a constant for cases that the offset is not in the array > range like tail loop. I tried to pass the integer constant all the way, and unfortunate that the `offsetInRange` is not compiled to a constant. The following assertion in the `vectorIntrinsics.cpp` will fail: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "not a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 01:21:40 GMT, Paul Sandoz wrote: > > Yeah, I agree that it's not good by adding a branch checking for > > `offsetInRange`. But actually I met the constant issue that passing the > > values all the way cannot guarantee the argument a constant in compiler at > > the compile time. Do you have any better idea to fixing this? > > That's odd, `boolean` constants are passed that are then converted to `int` > constants. Did you try passing integer constants all the way through? I will try again. I remember the main cause is the calling of `fromArray0` from `fromArray`, it is not annotated with `ForceInline`. The arguments might not be compiled to a constant for cases that the offset is not in the array range like tail loop. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Thu, 5 May 2022 01:13:23 GMT, Xiaohong Gong wrote: > Yeah, I agree that it's not good by adding a branch checking for > `offsetInRange`. But actually I met the constant issue that passing the > values all the way cannot guarantee the argument a constant in compiler at > the compile time. Do you have any better idea to fixing this? That's odd, `boolean` constants are passed that are then converted to `int` constants. Did you try passing integer constants all the way through? - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Fri, 29 Apr 2022 21:34:13 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Rename the "usePred" to "offsetInRange" > > IIUC when the hardware does not support predicated loads then any false > `offsetIntRange` value causes the load intrinsic to fail resulting in the > fallback, so it would not be materially any different to the current > behavior, just more uniformly implemented. > > Why can't the intrinsic support the passing a boolean directly? Is it > something to do with constants? If that is not possible I recommend creating > named constant values and pass those all the way through rather than > converting a boolean to an integer value. Then there is no need for a branch > checking `offsetInRange`. > > Might be better to hold off until the JEP is integrated and then update, > since this will conflict (`byte[]` and `ByteBuffer` load methods are removed > and `MemorySegment` load methods are added). You could prepare for that now > by branching off `vectorIntrinsics`. Thanks for your comments @PaulSandoz ! > IIUC when the hardware does not support predicated loads then any false > `offsetIntRange` value causes the load intrinsic to fail resulting in the > fallback, so it would not be materially any different to the current > behavior, just more uniformly implemented. Yes, it's true that this patch doesn't have any different to the hardware that does not support the predicated loads. It only benefits the predicated feature supported systems like ARM SVE and X86 AVX-512. > Why can't the intrinsic support the passing a boolean directly? Is it > something to do with constants? If that is not possible I recommend creating > named constant values and pass those all the way through rather than > converting a boolean to an integer value. Then there is no need for a branch > checking offsetInRange. Yeah, I agree that it's not good by adding a branch checking for `offsetInRange`. But actually I met the constant issue that passing the values all the way cannot guarantee the argument a constant in compiler at the compile time. Do you have any better idea to fixing this? > Might be better to hold off until the JEP is integrated and then update, > since this will conflict (byte[] and ByteBuffer load methods are removed and > MemorySegment load methods are added). You could prepare for that now by > branching off vectorIntrinsics. Agree. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the >> array boundary is implemented with pure java scalar code to avoid the IOOBE >> (IndexOutOfBoundaryException). This is necessary for architectures that do >> not support the predicate feature. Because the masked load is implemented >> with a full vector load and a vector blend applied on it. And a full vector >> load will definitely cause the IOOBE which is not valid. However, for >> architectures that support the predicate feature like SVE/AVX-512/RVV, it >> can be vectorized with the predicated load instruction as long as the >> indexes of the masked lanes are within the bounds of the array. For these >> architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE >> part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >>byte[] a, int offset, >>VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must >> check whether the current backend supports it and falls back to the java >> scalar version if not. This is different from the normal masked vector load >> that the compiler will generate a full vector load and a vector blend if the >> predicate load is not supported. So to let the compiler make the expected >> action, an additional flag (i.e. `usePred`) is added to the existing >> "loadMasked" intrinsic, with the value "true" for the IOOBE part while >> "false" for the normal load. And the compiler will fail to intrinsify if the >> flag is "true" and the predicate load is not supported by the backend, which >> means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` >> on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Rename the "usePred" to "offsetInRange" IIUC when the hardware does not support predicated loads then any false `offsetIntRange` value causes the load intrinsic to fail resulting in the fallback, so it would not be materially any different to the current behavior, just more uniformly implemented. Why can't the intrinsic support the passing a boolean directly? Is it something to do with constants? If that is not possible I recommend creating named constant values and pass those all the way through rather than converting a boolean to an integer value. Then there is no need for a branch checking `offsetInRange`. Might be better to hold off until the JEP is integrated and then update, since this will conflict (`byte[]` and `ByteBuffer` load methods are removed and `MemorySegment` load methods are added). You could prepare for that now by branching off `vectorIntrinsics`. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the >> array boundary is implemented with pure java scalar code to avoid the IOOBE >> (IndexOutOfBoundaryException). This is necessary for architectures that do >> not support the predicate feature. Because the masked load is implemented >> with a full vector load and a vector blend applied on it. And a full vector >> load will definitely cause the IOOBE which is not valid. However, for >> architectures that support the predicate feature like SVE/AVX-512/RVV, it >> can be vectorized with the predicated load instruction as long as the >> indexes of the masked lanes are within the bounds of the array. For these >> architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE >> part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >>byte[] a, int offset, >>VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must >> check whether the current backend supports it and falls back to the java >> scalar version if not. This is different from the normal masked vector load >> that the compiler will generate a full vector load and a vector blend if the >> predicate load is not supported. So to let the compiler make the expected >> action, an additional flag (i.e. `usePred`) is added to the existing >> "loadMasked" intrinsic, with the value "true" for the IOOBE part while >> "false" for the normal load. And the compiler will fail to intrinsify if the >> flag is "true" and the predicate load is not supported by the backend, which >> means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` >> on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Rename the "usePred" to "offsetInRange" @PaulSandoz Could you please take a look at the Java changes when you find time. This PR from @XiaohongGong is a very good step towards long standing Vector API wish list for better tail loop handling. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Wed, 20 Apr 2022 02:44:39 GMT, Xiaohong Gong wrote: >>> The blend should be with the intended-to-store vector, so that masked lanes >>> contain the need-to-store elements and unmasked lanes contain the loaded >>> elements, which would be stored back, which results in unchanged values. >> >> It may not work if memory is beyond legal accessible address space of the >> process, a corner case could be a page boundary. Thus re-composing the >> intermediated vector which partially contains actual updates but effectively >> perform full vector write to destination address may not work in all >> scenarios. > > Thanks for the comment! So how about adding the check for the valid array > range like the masked vector load? > Codes like: > > public final > void intoArray(byte[] a, int offset, >VectorMask m) { > if (m.allTrue()) { > intoArray(a, offset); > } else { > ByteSpecies vsp = vspecies(); > if (offset >= 0 && offset <= (a.length - vsp.length())) { // > a full range check > intoArray0(a, offset, m, /* usePred */ false); >// can be vectorized by load+blend_store > } else { > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > intoArray0(a, offset, m, /* usePred */ true); >// only be vectorized by the predicated store > } > } > } Thanks, this looks ok since out of range condition will not be intrinsified if targets does not support predicated vector store. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the >> array boundary is implemented with pure java scalar code to avoid the IOOBE >> (IndexOutOfBoundaryException). This is necessary for architectures that do >> not support the predicate feature. Because the masked load is implemented >> with a full vector load and a vector blend applied on it. And a full vector >> load will definitely cause the IOOBE which is not valid. However, for >> architectures that support the predicate feature like SVE/AVX-512/RVV, it >> can be vectorized with the predicated load instruction as long as the >> indexes of the masked lanes are within the bounds of the array. For these >> architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE >> part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >>byte[] a, int offset, >>VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must >> check whether the current backend supports it and falls back to the java >> scalar version if not. This is different from the normal masked vector load >> that the compiler will generate a full vector load and a vector blend if the >> predicate load is not supported. So to let the compiler make the expected >> action, an additional flag (i.e. `usePred`) is added to the existing >> "loadMasked" intrinsic, with the value "true" for the IOOBE part while >> "false" for the normal load. And the compiler will fail to intrinsify if the >> flag is "true" and the predicate load is not supported by the backend, which >> means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` >> on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional > commit since the last revision: > > Rename the "usePred" to "offsetInRange" Rest of the patch looks good to me. src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: > 1230: // out when current case uses the predicate feature. > 1231: if (!supports_predicate) { > 1232: bool use_predicate = false; If we rename this to needs_predicate it will be easier to understand. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
On Wed, 20 Apr 2022 02:46:09 GMT, Xiaohong Gong wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java >> line 2861: >> >>> 2859: ByteSpecies vsp = (ByteSpecies) species; >>> 2860: if (offset >= 0 && offset <= (a.length - >>> species.vectorByteSize())) { >>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* >>> usePred */ false).maybeSwap(bo); >> >> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange >> would be easier to follow. > > Thanks for the review. I will change it later. The name is updated to `offsetInRange`. Thanks! - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Rename the "usePred" to "offsetInRange" - Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/8f9e8a3c..9b2d2f19 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk=8035=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk=8035=00-01 Stats: 393 lines in 41 files changed: 0 ins; 0 del; 393 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035