Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Ferenc Rakoczi Fri, 22 May 2026 06:16:35 -0700

On Mon, 18 May 2026 15:06:41 GMT, Andrew Dinn <[email protected]> wrote:


>> Ferenc Rakoczi has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Accepting more suggestions from Andrew Dinn.
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7869:
> 
>> 7867:     __ ldr(a_i, __ post(a, 8));
>> 7868:     gpr_partial_mult_52(a_i, b_0, high, low, tmp, limb_mask);
>> 7869:     __ andr(n, low, limb_mask);
> 
> It would be clearer to use `mov(n, low)` here or, at the very least `orr(n, 
> zr, low)`. `low` has already been masked with `limb_mask`, which makes the 
> `andr` look redundant when, in fact, the point is to save the initial value 
> computed for `low`.

Accepted.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7873:
> 
>> 7871:     neon_partial_mult_64(B, b_highs, a_vals, 0);
>> 7872: 
>> 7873:     // Limb 0 cont
> 
> Suggestion:
> 
>     // Limb 0 modulus computation
>     // n.b. modulus computation requires multiplying successive
>     // limbs of the product by corresponding limbs of the p256
>     // prime adding the result to the limb and folding this
>     // partial result into a running 256-bit sum in c_i. Limbs
>     // of c_i are stored via c_ptr once carries are included.
>     // n.b. the mul + add is omitted for limb 2 since the
>     // corresponding prime bits are zero.

Accepted.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7899:
> 
>> 7897:     gpr_partial_mult_52(a_i, b_2, high, low, tmp, limb_mask);
>> 7898:     __ add(c_i, c_i, low);
>> 7899:     __ str(c_i, Address(c_ptr, 8));
> 
> I don't really like the hard-coded numbers in these `c_ptr` address 
> expressions. At the very least it would be better if they were be specified 
> using a const expression like `nnn * BytesPerLong` to show that we are 
> indexing an array of 64 bit data at index `nnn`.
> 
> A better idea would be to define a static helper method that also checks the 
> index is in range
> 
> 
> #define C_DATA_COUNT 5
> static int c_slot_offset(index i) {
>   assert(0 <= i && i < C_SLOT_COUNT, "invalid index for c_i data");
>   return i * BytesPerLong;
> }
> 
> 
> You can then define the address as `Address(c_ptr, c_offset(1))` etc.
> 
> Also, once you have defined `C_DATA_COUNT` you can also define a constant 
> C_DATA_SIZE for use when you extend the stack:
> 
> 
> #define C_DATA_SIZE align_up(C_DATA_COUNT, 2) * BytesPerLong;
> 
>     . . .
>     __ sub(sp, sp, C_DATA_SIZE);
>     . . .
> 
> 
> The same consideration applies with even more force as regards the hard-wired 
> offsets used when accessing the Neon multiplication product data via `sp`.
> 
> 
>     __ ldr(low_1, Address(sp));
>     __ ldr(high_1, Address(sp, 16));
> 
> 
> This also allows you to fold the details of the data layout into the function:
>  
> 
> #define MUL_DATA_COUNT 8
> static int mul_data_offset(index i, boolean want_high) {
>   assert(0 <= i && i < MUL_DATA_COUNT, "invalid index for mul data");
>   // data is stored as successive low/high pairs
>   // (l0, l1), (h0, h1), ... (l6, l7), (h6, h7)
>   int pair_start = (i >> 1) * 4;
>   int high_select = (want_high ? 2 : 0)
>   int pair_element = (i & 1);
>   int idx = pair_start + high_select + pair_element;
>   return (idx * BytesPerLong);
> }
> 
>     . . .
>     __ ldr(low_1, Address(sp, mul_data_offset(0, false));
>     __ ldr(high_1, Address(sp, mul_data_offset(0, true)));
>     . . .
> 
> 
> Likewise you can define a constant MUL_DATA_SIZE for use when you extend the 
> stack:
> 
> 
> #define MUL_DATA_SIZE (MUL_DATA_COUNT * 2) * BytesPerLong;
> 
>     . . .
>     __ sub(sp, sp, MUL_DATA_SIZE);
>     . . .

I think this would be over-engineering. We are talking about fixed length 
arrays here, k * BytesPerLong clearly enough indicates the index  k in my 
opinion (even with "precomputing" for indicies 0 and 1).

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8077:
> 
>> 8075:       + b_0 + b_1 + b_2;
>> 8076:         b_0 = b_1 = b_2 = noreg;
>> 8077: 
> 
> This freeing of registers would be better done at line 8002 (i.e. before 
> processing a3) and with register b_3 also freed. It highlights to the 
> maintainer that b0 - b3 are no longer needed from that point on.

Done.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8267:
> 
>> 8265:     //   P521OrderField:         19 = 8 + 8 + 2 + 1
>> 8266:     // Special Cases 5, 10, 14, 16, 19
>> 8267: 
> 
> You need to insert the standard call `load_archive_data` here and return any 
> stub that is found.

Done.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8377:
> 
>> 8375:       VSeq<4> b_vec(20);
>> 8376: 
>> 8377:       __ ldr(a9, Address(aLimbs, 64));
> 
> It makes things more obvious to a maintainer and avoids mistakes should 
> anything need patching if you use 8 * BytesPerLong here instead of 64 and 
> likewise for all the other hard-wired Address offsets used throughout this 
> method. That way the offsets directly relate to the vector and GPR counts 
> mentioned in the comments.

Accepted.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 8568:
> 
>> 8566:     __ mov(r0, zr); // return 0
>> 8567:     __ ret(lr);
>> 8568:     return start;
> 
> You need to call `store_archive_data` here.

Done.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288540269
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288541160
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288542197
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288542627
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288556389
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288557945
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3288556748

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to