On Fri, 15 May 2026 09:52:20 GMT, Ferenc Rakoczi <[email protected]> wrote:
>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() >> method and IntegerPolynomial.conditionalAssign(). Since 64-bit >> multiplication is not supported on Neon and manually performing this >> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr >> approach is used. Neon instructions are used to compute intermediate values >> used in the last two iterations of the main "loop", while the GPRs compute >> the first few iterations. At the method level this improves performance by >> ~9% and at the API level roughly 5%. >> >> >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Ferenc Rakoczi has updated the pull request incrementally with one additional > commit since the last revision: > > Accepting more suggestions from Andrew Dinn. A few drive-by comments. I have not dug into the core of the implementation. How would you test this for correctness? src/hotspot/cpu/aarch64/register_aarch64.hpp line 546: > 544: template<int N> > 545: VSeq<N-1> vs_tail(const VSeq<N>& v) { > 546: static_assert(N > 1, "tail sequence length must be greater than 2"); Which one is it, `N > 1`, or `greater than 2`? src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8018: > 8016: __ ldr(c_i, c_ptr); > 8017: > 8018: // Limb 1 The numbering is a bit off here. There are duplicate `// Limb 2` comments. Should this one be `// Limb 0`? src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8183: > 8181: tmp2 = *common_regs++, > 8182: tmp3 = *common_regs++, > 8183: tmp4 = *common_regs++; Cannot quite see where these `tmp*`-s are used? src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8551: > 8549: > 8550: Label default_loop; > 8551: __ BIND(default_loop); How many iterations does this loop do when `length` is `0`? The `sub` below would underflow to `-1`, right? Is it possible to get here with `length == 0`? src/hotspot/cpu/aarch64/vm_version_aarch64.cpp line 462: > 460: if (FLAG_IS_DEFAULT(UseIntPolyIntrinsics)) { > 461: UseIntPolyIntrinsics = true; > 462: } Indenting is a bit off here. Also, should this depend on `CPU_ASIMD`, like ChaCha20 intrinsic enablement a few blocks above? ------------- PR Review: https://git.openjdk.org/jdk/pull/30941#pullrequestreview-4326274500 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272151694 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272147603 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272187350 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272136785 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272122856
