On Thu, 2 Feb 2023 18:07:59 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi, I modified a version by using the old implementation for the tail loop >> instead of adding the new intrinsics. The code looks like: >> >> public VectorMask<E> indexInRange(int offset, int limit) { >> int vlength = length(); >> if (offset >= 0 && limit - offset >= length()) { >> return this; >> } else if (offset >= limit) { >> return vectorSpecies().maskAll(false); >> } >> >> Vector<E> iota = vectorSpecies().zero().addIndex(1); >> VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength); >> return this.andNot(badMask); >> } >> >> And I tested the performance of the new added benchmarks with different >> vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show >> that the performance of changed version is not better than the current >> version, if the array size is not aligned with the vector size, especially >> for the double/long type with larger size. >> >> Here are some raw data with NEON: >> >> Benchmark (size) Mode Cnt current >> modified Units >> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 2654.919 >> 2584.423 ops/ms >> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 1830.364 >> 1802.876 ops/ms >> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 1058.548 >> 1073.742 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 594.920 >> 593.832 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 308.678 >> 149.279 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 160.639 >> 74.579 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 1097.567 >> 1104.008 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 617.845 >> 606.886 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 315.978 >> 152.046 ops/ms >> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 1165.279 >> 1205.486 ops/ms >> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 633.648 >> 631.672 ops/ms >> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 315.370 >> 154.144 ops/ms >> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 639.840 >> 633.623 ops/ms >> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 312.267 >> 152.788 ops/ms >> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 163.028 >> 78.150 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 1834.318 >> 1800.318 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 1105.695 >> 1094.347 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 602.442 >> 599.827 ops/ms >> >> >> And the data with SVE 256-bit vector size: >> >> Benchmark (size) Mode Cnt current >> modified Units >> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 23772.370 >> 22921.113 ops/ms >> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 18930.388 >> 17920.910 ops/ms >> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 13528.610 >> 13282.504 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 7850.522 >> 7975.720 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4281.749 >> 4373.926 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 2160.001 >> 604.458 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 13594.943 >> 13306.904 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 8163.134 >> 7912.343 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 4335.529 >> 4198.555 ops/ms >> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 22106.880 >> 20348.266 ops/ms >> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 11711.588 >> 10958.299 ops/ms >> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 5501.034 >> 5358.441 ops/ms >> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 9832.578 >> 9829.398 ops/ms >> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 4979.326 >> 4947.166 ops/ms >> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 2269.131 >> 614.204 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 19865.866 >> 19297.628 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 14005.214 >> 13592.407 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 8766.450 >> 8531.675 ops/ms >> >> As a conclusion, I prefer to keep the current version. WDYT? > > Yes, I agree, I did collect performance data with your benchmarks at various > AVX levels and see significant gains. Cause of performance variation b/w > integer and floating point cases is due to below limitation. > https://github.com/openjdk/jdk/pull/12064#discussion_r1094101761 > Which can be addressed in a separate PR. > > FTR here are performance numbers at UseAVX=3 > > > Benchmark (size) Mode Cnt Score > Error Units > IndexInRangeBenchmark.byteIndexInRange 1024 thrpt 2 74983.406 > ops/ms > IndexInRangeBenchmark.byteIndexInRange 1027 thrpt 2 19156.962 > ops/ms > IndexInRangeBenchmark.doubleIndexInRange 1024 thrpt 2 11368.179 > ops/ms > IndexInRangeBenchmark.doubleIndexInRange 1027 thrpt 2 2165.207 > ops/ms > IndexInRangeBenchmark.floatIndexInRange 1024 thrpt 2 18736.787 > ops/ms > IndexInRangeBenchmark.floatIndexInRange 1027 thrpt 2 3798.996 > ops/ms > IndexInRangeBenchmark.intIndexInRange 1024 thrpt 2 18797.863 > ops/ms > IndexInRangeBenchmark.intIndexInRange 1027 thrpt 2 5455.317 > ops/ms > IndexInRangeBenchmark.longIndexInRange 1024 thrpt 2 11866.493 > ops/ms > IndexInRangeBenchmark.longIndexInRange 1027 thrpt 2 2227.896 > ops/ms > IndexInRangeBenchmark.shortIndexInRange 1024 thrpt 2 46921.520 > ops/ms > IndexInRangeBenchmark.shortIndexInRange 1027 thrpt 2 8532.394 > ops/ms Thanks for the performance testing on x86 systems! Agree that a separate PR is fine, and I will address it once this PR merged. ------------- PR: https://git.openjdk.org/jdk/pull/12064