On Thu, 2 Feb 2023 06:38:58 GMT, Jatin Bhateja <[email protected]> wrote:
>>> Thanks for the review @jatin-bhateja !
>>>
>>> > Have you tried introducing just case 3 (first) and then case 2 in
>>> > existing indexInRange implementation.
>>>
>>> Yes, I tried with this way as well, together with the performance testing
>>> compared with current version. The result is:
>>>
>>> 1. For cases that the array size is aligned with the vector size, the
>>> performance is almost the same with each other.
>>> 2. For cases that the array size is not aligned (which needs the masked
>>> operation), the performance with current version can improve about 5%~6%
>>> compared with the way as you suggested. But the improvement is limited to
>>> the cases with smaller vector size, whose tail loop size is bigger.
>>>
>>> So, adding the new intrinsic has no side effect for me, except for the more
>>> complex java side code and one additional intrinsic.
>>
>> Another benefit is the code change will be simpler, that almost all the
>> files except the `AbstractMask.java` do not need to be changed. I will make
>> the modification and rerun the benchmarks to decide whether it's necessary
>> to add the new intrinsics. Thanks!
>
> While you talked about Java side changes, I found another opportunity for
> optimization in checkIndex0 implementation, in the following code snippet
> from checkIndex0 method, indexLimit is guaranteed to be a +ve value.
>
>
> int indexLimit = Math.max(0, Math.min(length - offset, vlength));
> VectorMask<E> badMask =
> iota.compare(GE, iota.broadcast(indexLimit));
>
> For float/double iota vectors, subsequent broadcast operation accepting long
> argument checks for precision loss before broadcasting floating point value.
>
>
> long longToElementBits(long value) {
> // Do the conversion, and then test it for failure.
> float e = (float) value;
> if ((long) e != value) {
> throw badElementBits(value, e);
> }
> return toBits(e);
> }`
>
>
> This can be saved by doing a prior casting of indexLimit to floating point
> value before calling broadcast, effectively we may need to move checkIndex0
> to template generated abstract vector classes.
Hi, I modified a version by using the old implementation for the tail loop
instead of adding the new intrinsics. The code looks like:
public VectorMask<E> indexInRange(int offset, int limit) {
int vlength = length();
if (offset >= 0 && limit - offset >= length()) {
return this;
} else if (offset >= limit) {
return vectorSpecies().maskAll(false);
}
Vector<E> iota = vectorSpecies().zero().addIndex(1);
VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
return this.andNot(badMask);
}
And I tested the performance of the new added benchmarks with different vector
size on NEON/SVE and x86 avx2/avx512 architectures. The results show that the
performance of changed version is not better than the current version, if the
array size is not aligned with the vector size, especially for the double/long
type with larger size.
Here are some raw data with NEON:
Benchmark (size) Mode Cnt current
modified Units
IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 2654.919
2584.423 ops/ms
IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 1830.364
1802.876 ops/ms
IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 1058.548
1073.742 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 594.920
593.832 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 308.678
149.279 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 160.639
74.579 ops/ms
IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 1097.567
1104.008 ops/ms
IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 617.845
606.886 ops/ms
IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 315.978
152.046 ops/ms
IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 1165.279
1205.486 ops/ms
IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 633.648
631.672 ops/ms
IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 315.370
154.144 ops/ms
IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 639.840
633.623 ops/ms
IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 312.267
152.788 ops/ms
IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 163.028
78.150 ops/ms
IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 1834.318
1800.318 ops/ms
IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 1105.695
1094.347 ops/ms
IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 602.442
599.827 ops/ms
And the data with SVE 256-bit vector size:
Benchmark (size) Mode Cnt current
modified Units
IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 23772.370
22921.113 ops/ms
IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 18930.388
17920.910 ops/ms
IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 13528.610
13282.504 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 7850.522
7975.720 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4281.749
4373.926 ops/ms
IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 2160.001
604.458 ops/ms
IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 13594.943
13306.904 ops/ms
IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 8163.134
7912.343 ops/ms
IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 4335.529
4198.555 ops/ms
IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 22106.880
20348.266 ops/ms
IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 11711.588
10958.299 ops/ms
IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 5501.034
5358.441 ops/ms
IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 9832.578
9829.398 ops/ms
IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 4979.326
4947.166 ops/ms
IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 2269.131
614.204 ops/ms
IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 19865.866
19297.628 ops/ms
IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 14005.214
13592.407 ops/ms
IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 8766.450
8531.675 ops/ms
As a conclusion, I prefer to keep the current version. WDYT?
-------------
PR: https://git.openjdk.org/jdk/pull/12064