On Mon, 26 Oct 2020 09:19:45 GMT, Dong Bo <don...@openjdk.org> wrote:
> BigInteger.shiftRightImplWorker and BigInteger.shiftLeftImplWorker are not > intrinsified on aarch64, which have been done on x86_64. > We can implement them via USHL NEON instruction (register), which handles > four integers one time at most, against just integer C2 asm-code processed. > The usage of USHL can be found at: > https://developer.arm.com/documentation/dui0801/g/A64-SIMD-Vector-Instructions/USHL--vector-?lang=en > > Patch passed jtreg tier1-3 tests on our aarch64 server. > Tests in test/jdk/java/math/BigInteger/* runned specially for the correctness > of the implementation and passed. > > We tested test/micro/org/openjdk/bench/java/math/BigIntegers.java for > performance gain on Kunpeng916 and Kunpeng920. > The following performance improvements were seen with this implementation: > - Intrinsification of BigInteger.shiftLeft: 25.52% (Kunpeng916), 37.56% > (Kunpeng920) > - Intrinsification of BigInteger.shiftRight: 46.45% (Kunpeng916), 43.32% > (Kunpeng920) > > The BigIntegers.java JMH micro-benchmark results: > Benchmark Mode Cnt Score Error Units > > # Kunpeng 916, default > BigIntegers.testAdd avgt 25 33.554 ± 0.224 ns/op > BigIntegers.testHugeToString avgt 25 575.554 ± 40.656 ns/op > BigIntegers.testLargeToString avgt 25 190.098 ± 0.825 ns/op > **BigIntegers.testLeftShift avgt 25 1495.779 ± 12.365 ns/op** > BigIntegers.testMultiply avgt 25 7551.707 ± 39.309 ns/op > **BigIntegers.testRightShift avgt 25 605.302 ± 6.710 ns/op** > BigIntegers.testSmallToString avgt 25 179.034 ± 0.873 ns/op > > # Kunpeng 916, intrinsic: > BigIntegers.testAdd avgt 25 33.531 ± 0.222 ns/op > BigIntegers.testHugeToString avgt 25 578.038 ± 40.675 ns/op > BigIntegers.testLargeToString avgt 25 188.566 ± 0.855 ns/op > **BigIntegers.testLeftShift avgt 25 1191.651 ± 20.136 ns/op** > BigIntegers.testMultiply avgt 25 7492.711 ± 3.702 ns/op > **BigIntegers.testRightShift avgt 25 326.891 ± 6.033 ns/op** > BigIntegers.testSmallToString avgt 25 178.267 ± 1.501 ns/op > > # Kunpeng 920, default > BigIntegers.testAdd avgt 25 22.790 ± 0.167 ns/op > BigIntegers.testHugeToString avgt 25 432.428 ± 10.736 ns/op > BigIntegers.testLargeToString avgt 25 121.899 ± 3.356 ns/op > **BigIntegers.testLeftShift avgt 25 883.530 ± 53.714 ns/op** > BigIntegers.testMultiply avgt 25 5918.845 ± 94.937 ns/op > **BigIntegers.testRightShift avgt 25 329.762 ± 15.850 ns/op** > BigIntegers.testSmallToString avgt 25 117.460 ± 3.040 ns/op > > # Kunpeng 920, intrinsic > BigIntegers.testAdd avgt 25 21.791 ± 0.085 ns/op > BigIntegers.testHugeToString avgt 25 415.209 ± 32.170 ns/op > BigIntegers.testLargeToString avgt 25 124.635 ± 2.157 ns/op > **BigIntegers.testLeftShift avgt 25 551.710 ± 7.836 ns/op** > BigIntegers.testMultiply avgt 25 5869.401 ± 54.803 ns/op > **BigIntegers.testRightShift avgt 25 186.896 ± 6.378 ns/op** > BigIntegers.testSmallToString avgt 25 117.543 ± 3.036 ns/op @theRealAph Thanks for the quick review. Updated a version for small BigIntegers. The less-than-four-words loop is unpacked for minor performance improvements. Also modified code in ./test/micro/org/openjdk/bench/java/math/BigIntegers.java for small BigIntegers performance tests. New parameter `maxNumbits` in the test indicates bits count of a BigInteger range in `[maxNumbits - 31, maxNumbits]`. Incremental modification: https://github.com/openjdk/jdk/pull/861/commits/7a5d76f51e693d441dee30b3d109d1b67b525378 According to the new tests, performance regress 3%~%6 only if `maxNumbits == 32`. Seems the regression is inevitably caused by the intrinsic shared code, performance regress even if we return immediately from the stub, like: /* marked as cbz_ret below */ address generate_bigIntegerLeftShift() { __ align(CodeEntryAlignment); StubCodeMark mark(this, "StubRoutines", "bigIntegerLeftShiftWorker"); address start = __ pc(); Register numIter = c_rarg4; __ cbz(numIter, Exit); __ ret(lr); } The performance of `cbz_ret` is almost same with intrinsified 32-MaxNumbits tests. Similar tests, returns immediately with `__ ret(0)`, regress on a x86_64 platform too. The BigIntegers.java JMH micro-benchmark results of small BigIntegers (~256bits): Benchmark (maxNumbits) Mode Cnt Score Error Units # kunpeng 916, intrinsic BigIntegers.testSmallLeftShift 32 avgt 25 51.444 ± 0.256 ns/op (cbz_ret) BigIntegers.testSmallLeftShift 32 avgt 25 51.168 ± 0.235 ns/op BigIntegers.testSmallLeftShift 64 avgt 25 53.566 ± 0.694 ns/op BigIntegers.testSmallLeftShift 96 avgt 25 53.398 ± 0.651 ns/op BigIntegers.testSmallLeftShift 128 avgt 25 55.949 ± 0.977 ns/op BigIntegers.testSmallLeftShift 160 avgt 25 55.617 ± 0.568 ns/op BigIntegers.testSmallLeftShift 192 avgt 25 56.285 ± 0.959 ns/op BigIntegers.testSmallLeftShift 224 avgt 25 58.201 ± 0.965 ns/op BigIntegers.testSmallLeftShift 256 avgt 25 58.655 ± 0.953 ns/op BigIntegers.testSmallRightShift 32 avgt 25 56.210 ± 0.708 ns/op (cbz_ret) BigIntegers.testSmallRightShift 32 avgt 25 56.072 ± 0.712 ns/op BigIntegers.testSmallRightShift 64 avgt 25 56.891 ± 0.458 ns/op BigIntegers.testSmallRightShift 96 avgt 25 56.257 ± 0.185 ns/op BigIntegers.testSmallRightShift 128 avgt 25 56.970 ± 0.458 ns/op BigIntegers.testSmallRightShift 160 avgt 25 58.041 ± 0.344 ns/op BigIntegers.testSmallRightShift 192 avgt 25 58.740 ± 0.405 ns/op BigIntegers.testSmallRightShift 224 avgt 25 60.550 ± 0.382 ns/op BigIntegers.testSmallRightShift 256 avgt 25 65.617 ± 0.266 ns/op # kunpeng 916, default BigIntegers.testSmallLeftShift 32 avgt 25 49.350 ± 0.944 ns/op BigIntegers.testSmallLeftShift 64 avgt 25 56.810 ± 0.930 ns/op BigIntegers.testSmallLeftShift 96 avgt 25 59.472 ± 0.270 ns/op BigIntegers.testSmallLeftShift 128 avgt 25 61.208 ± 0.252 ns/op BigIntegers.testSmallLeftShift 160 avgt 25 63.339 ± 0.328 ns/op BigIntegers.testSmallLeftShift 192 avgt 25 66.456 ± 0.418 ns/op BigIntegers.testSmallLeftShift 224 avgt 25 68.437 ± 0.294 ns/op BigIntegers.testSmallLeftShift 256 avgt 25 70.301 ± 0.306 ns/op BigIntegers.testSmallRightShift 32 avgt 25 53.289 ± 0.272 ns/op BigIntegers.testSmallRightShift 64 avgt 25 65.618 ± 4.097 ns/op BigIntegers.testSmallRightShift 96 avgt 25 70.805 ± 3.695 ns/op BigIntegers.testSmallRightShift 128 avgt 25 70.862 ± 4.205 ns/op BigIntegers.testSmallRightShift 160 avgt 25 79.921 ± 3.272 ns/op BigIntegers.testSmallRightShift 192 avgt 25 75.168 ± 0.224 ns/op BigIntegers.testSmallRightShift 224 avgt 25 79.779 ± 0.609 ns/op BigIntegers.testSmallRightShift 256 avgt 25 84.364 ± 0.540 ns/op # kunepng 920, intrinsic BigIntegers.testSmallLeftShift 32 avgt 25 31.404 ± 0.984 ns/op (cbz_ret) BigIntegers.testSmallLeftShift 32 avgt 25 31.272 ± 0.558 ns/op BigIntegers.testSmallLeftShift 64 avgt 25 33.558 ± 1.354 ns/op BigIntegers.testSmallLeftShift 96 avgt 25 34.731 ± 1.238 ns/op BigIntegers.testSmallLeftShift 128 avgt 25 36.082 ± 1.196 ns/op BigIntegers.testSmallLeftShift 160 avgt 25 36.155 ± 0.932 ns/op BigIntegers.testSmallLeftShift 192 avgt 25 38.442 ± 0.743 ns/op BigIntegers.testSmallLeftShift 224 avgt 25 38.404 ± 1.108 ns/op BigIntegers.testSmallLeftShift 256 avgt 25 39.381 ± 1.140 ns/op BigIntegers.testSmallRightShift 32 avgt 25 30.821 ± 0.533 ns/op (cbz_ret) BigIntegers.testSmallRightShift 32 avgt 25 30.662 ± 1.625 ns/op BigIntegers.testSmallRightShift 64 avgt 25 32.686 ± 1.000 ns/op BigIntegers.testSmallRightShift 96 avgt 25 33.922 ± 1.068 ns/op BigIntegers.testSmallRightShift 128 avgt 25 34.997 ± 1.155 ns/op BigIntegers.testSmallRightShift 160 avgt 25 35.763 ± 1.159 ns/op BigIntegers.testSmallRightShift 192 avgt 25 38.180 ± 0.735 ns/op BigIntegers.testSmallRightShift 224 avgt 25 37.985 ± 1.619 ns/op BigIntegers.testSmallRightShift 256 avgt 25 39.957 ± 0.820 ns/op # kunpeng 920, default BigIntegers.testSmallLeftShift 32 avgt 25 29.524 ± 0.861 ns/op BigIntegers.testSmallLeftShift 64 avgt 25 35.917 ± 0.467 ns/op BigIntegers.testSmallLeftShift 96 avgt 25 36.915 ± 0.317 ns/op BigIntegers.testSmallLeftShift 128 avgt 25 39.709 ± 0.858 ns/op BigIntegers.testSmallLeftShift 160 avgt 25 42.796 ± 0.824 ns/op BigIntegers.testSmallLeftShift 192 avgt 25 43.612 ± 0.319 ns/op BigIntegers.testSmallLeftShift 224 avgt 25 45.971 ± 0.336 ns/op BigIntegers.testSmallLeftShift 256 avgt 25 48.399 ± 0.405 ns/op BigIntegers.testSmallRightShift 32 avgt 25 29.122 ± 0.870 ns/op BigIntegers.testSmallRightShift 64 avgt 25 35.404 ± 1.236 ns/op BigIntegers.testSmallRightShift 96 avgt 25 37.899 ± 1.478 ns/op BigIntegers.testSmallRightShift 128 avgt 25 39.570 ± 0.564 ns/op BigIntegers.testSmallRightShift 160 avgt 25 44.768 ± 1.423 ns/op BigIntegers.testSmallRightShift 192 avgt 25 44.777 ± 1.433 ns/op BigIntegers.testSmallRightShift 224 avgt 25 49.085 ± 0.465 ns/op BigIntegers.testSmallRightShift 256 avgt 25 48.871 ± 1.086 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/861