On Fri, 15 May 2026 09:52:20 GMT, Ferenc Rakoczi <[email protected]> wrote:
>> An aarch64 implementation of the MontgomeryIntegerPolynomial256.mult() >> method and IntegerPolynomial.conditionalAssign(). Since 64-bit >> multiplication is not supported on Neon and manually performing this >> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr >> approach is used. Neon instructions are used to compute intermediate values >> used in the last two iterations of the main "loop", while the GPRs compute >> the first few iterations. At the method level this improves performance by >> ~9% and at the API level roughly 5%. >> >> >> >> --------- >> - [x] I confirm that I make this contribution in accordance with the >> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). > > Ferenc Rakoczi has updated the pull request incrementally with one additional > commit since the last revision: > > Accepting more suggestions from Andrew Dinn. @ferakocz Still working my way through this. Here are a few more recommendations for comments and code cleanups. Also, I offered a correction to a comment where I misread the data layout for the Neon data written to the stack. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7869: > 7867: __ ldr(a_i, __ post(a, 8)); > 7868: gpr_partial_mult_52(a_i, b_0, high, low, tmp, limb_mask); > 7869: __ andr(n, low, limb_mask); It would be clearer to use `mov(n, low)` here or, at the very least `orr(n, zr, low)`. `low` has already been masked with `limb_mask`, which makes the `andr` look redundant when, in fact, the point is to save the initial value computed for `low`. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7873: > 7871: neon_partial_mult_64(B, b_highs, a_vals, 0); > 7872: > 7873: // Limb 0 cont Suggestion: // Limb 0 modulus computation // n.b. modulus computation requires multiplying successive // limbs of the product by corresponding limbs of the p256 // prime adding the result to the limb and folding this // partial result into a running 256-bit sum in c_i. Limbs // of c_i are stored via c_ptr once carries are included. // n.b. the mul + add is omitted for limb 2 since the // corresponding prime bits are zero. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7899: > 7897: gpr_partial_mult_52(a_i, b_2, high, low, tmp, limb_mask); > 7898: __ add(c_i, c_i, low); > 7899: __ str(c_i, Address(c_ptr, 8)); I don't really like the hard-coded numbers in these `c_ptr` address expressions. At the very least it would be better if they were be specified using a const expression like `nnn * BytesPerLong` to show that we are indexing an array of 64 bit data at index `nnn`. A better idea would be to define a static helper method that also checks the index is in range #define C_DATA_COUNT 5 static int c_slot_offset(index i) { assert(0 <= i && i < C_SLOT_COUNT, "invalid index for c_i data"); return i * BytesPerLong; } You can then define the address as `Address(c_ptr, c_offset(1))` etc. Also, once you have defined `C_DATA_COUNT` you can also define a constant C_DATA_SIZE for use when you extend the stack: #define C_DATA_SIZE align_up(C_DATA_COUNT, 2) * BytesPerLong; . . . __ sub(sp, sp, C_DATA_SIZE); . . . The same consideration applies with even more force as regards the hard-wired offsets used when accessing the Neon multiplication product data via `sp`. __ ldr(low_1, Address(sp)); __ ldr(high_1, Address(sp, 16)); This also allows you to fold the details of the data layout into the function: #define MUL_DATA_COUNT 8 static int mul_data_offset(index i, boolean want_high) { assert(0 <= i && i < MUL_DATA_COUNT, "invalid index for mul data"); // data is stored as successive low/high pairs // (lo0, l1), (hi0, hi1), ... (hi2, hi3) int pair_start = (i >> 1) * 4; int high_select = (want_high ? 2 : 0) int pair_element = (i & 1); int idx = pair_start + high_select + pair_element; return (idx * BytesPerLong); } . . . __ ldr(low_1, Address(sp, mul_data_offset(0, false)); __ ldr(high_1, Address(sp, mul_data_offset(0, true))); . . . Likewise you can define a constant MUL_DATA_SIZE for use when you extend the stack: #define MUL_DATA_SIZE (MUL_DATA_COUNT * 2) * BytesPerLong; . . . __ sub(sp, sp, MUL_DATA_SIZE); . . . ------------- PR Review: https://git.openjdk.org/jdk/pull/30941#pullrequestreview-4311333998 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3259942006 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3260445976 PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3267260201
