This patch optimizes vectorial rotate (immediate) on aarch64 with shift and insert instructions, i.e. SLI and SRI.
Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. Tests under `test/hotspot/jtreg/compiler/c2/cr6340864/` runned specially for the correctness and passed. The JMH micro `test/micro/org/openjdk/bench/java/lang/RotateBenchmark.java` is used for performance test. Witnessed ~15.4% performance improvements on Kunpeng920 (CPU tsv110), but ~15.8% regression on Kunpeng916 (CPU A72). So a switch `UseSIMDShiftInsertForRotation` is introduced on aarch64 with default value `false`, and set `true` for Kunpeng920. The `RotateBenchmark.java` JMH micro-benchmark results on Kunpeng920: Benchmark (SHIFT) (TESTSIZE) Mode Cnt Score Error Units # kunpeng 920, -XX:-UseSIMDShiftInsertForRotation RotateBenchmark.testRotateLeftI 20 1024 thrpt 10 3524.840 ± 2.365 ops/ms RotateBenchmark.testRotateLeftIImm 20 1024 thrpt 10 3961.288 ± 0.897 ops/ms RotateBenchmark.testRotateLeftL 20 1024 thrpt 10 1704.321 ± 11.309 ops/ms RotateBenchmark.testRotateLeftLImm 20 1024 thrpt 10 2137.924 ± 2.215 ops/ms RotateBenchmark.testRotateRightI 20 1024 thrpt 10 3536.960 ± 7.945 ops/ms RotateBenchmark.testRotateRightIImm 20 1024 thrpt 10 3961.552 ± 0.673 ops/ms RotateBenchmark.testRotateRightL 20 1024 thrpt 10 1729.868 ± 0.468 ops/ms RotateBenchmark.testRotateRightLImm 20 1024 thrpt 10 2132.458 ± 3.385 ops/ms # kunpeng 920, default, -XX:+UseSIMDShiftInsertForRotation RotateBenchmark.testRotateLeftI 20 1024 thrpt 10 3504.602 ± 21.609 ops/ms RotateBenchmark.testRotateLeftIImm 20 1024 thrpt 10 4569.820 ± 7.455 ops/ms RotateBenchmark.testRotateLeftL 20 1024 thrpt 10 1730.735 ± 0.701 ops/ms RotateBenchmark.testRotateLeftLImm 20 1024 thrpt 10 2469.796 ± 0.981 ops/ms RotateBenchmark.testRotateRightI 20 1024 thrpt 10 3540.899 ± 7.679 ops/ms RotateBenchmark.testRotateRightIImm 20 1024 thrpt 10 4571.876 ± 0.879 ops/ms RotateBenchmark.testRotateRightL 20 1024 thrpt 10 1731.499 ± 0.877 ops/ms RotateBenchmark.testRotateRightLImm 20 1024 thrpt 10 2469.454 ± 0.705 ops/ms This also moves all logical and shifting NEON instructions from `aarch64.ad` to `aarch64_neon.ad`, and has two minor improvements of supporting vector length 4 for `vsraa8B_imm` and `vsrla8B_imm`, vector length 2 for `vsraa4S_imm` and `vsrla4S_imm`. ------------- Commit messages: - 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions Changes: https://git.openjdk.java.net/jdk/pull/1761/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1761&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8256820 Stats: 2899 lines in 9 files changed: 1561 ins; 1014 del; 324 mod Patch: https://git.openjdk.java.net/jdk/pull/1761.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/1761/head:pull/1761 PR: https://git.openjdk.java.net/jdk/pull/1761