Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Aleksey Shipilev Wed, 20 May 2026 01:27:07 -0700

On Fri, 15 May 2026 09:52:20 GMT, Ferenc Rakoczi <[email protected]> wrote:


>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() 
>> method and IntegerPolynomial.conditionalAssign(). Since 64-bit 
>> multiplication is not supported on Neon and manually performing this 
>> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr 
>> approach is used. Neon instructions are used to compute intermediate values 
>> used in the last two iterations of the main "loop", while the GPRs compute 
>> the first few iterations. At the method level this improves performance by 
>> ~9% and at the API level roughly 5%.
>> 
>> 
>> 
>> ---------
>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>
> Ferenc Rakoczi has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Accepting more suggestions from Andrew Dinn.

A few drive-by comments. I have not dug into the core of the implementation.

How would you test this for correctness?

src/hotspot/cpu/aarch64/register_aarch64.hpp line 546:

> 544: template<int N>
> 545: VSeq<N-1> vs_tail(const VSeq<N>& v) {
> 546:   static_assert(N > 1, "tail sequence length must be greater than 2");

Which one is it, `N > 1`, or `greater than 2`?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8018:

> 8016:     __ ldr(c_i, c_ptr);
> 8017: 
> 8018:     // Limb 1

The numbering is a bit off here. There are duplicate `// Limb 2` comments. 
Should this one be `// Limb 0`?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8183:

> 8181:       tmp2 = *common_regs++,
> 8182:       tmp3 = *common_regs++,
> 8183:       tmp4 = *common_regs++;

Cannot quite see where these `tmp*`-s are used?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8551:

> 8549: 
> 8550:       Label default_loop;
> 8551:       __ BIND(default_loop);

How many iterations does this loop do when `length` is `0`? The `sub` below 
would underflow to `-1`, right? Is it possible to get here with `length == 0`?

src/hotspot/cpu/aarch64/vm_version_aarch64.cpp line 462:

> 460:   if (FLAG_IS_DEFAULT(UseIntPolyIntrinsics)) {
> 461:         UseIntPolyIntrinsics = true;
> 462:   }

Indenting is a bit off here. Also, should this depend on `CPU_ASIMD`, like 
ChaCha20 intrinsic enablement a few blocks above?

-------------

PR Review: https://git.openjdk.org/jdk/pull/30941#pullrequestreview-4326274500
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272151694
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272147603
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272187350
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272136785
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3272122856

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to