Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Ferenc Rakoczi Wed, 17 Jun 2026 03:13:55 -0700

On Mon, 15 Jun 2026 12:44:43 GMT, Andrew Dinn <[email protected]> wrote:


>> I still have a problem with this.
>> 
>> With your definition of `vs_st1_interleaved` we end up with this weird 
>> organization of the multiplication products where we have two low pairs and 
>> then their corresponding high pairs repeated x 4. There is nothing in the 
>> code that explains this layout other than the comment here at the point of 
>> write (n.b. my 'over-engineered' suggestion was actually a way of making it 
>> explicit what was going on here).
>> 
>> That results in some rather unobvious indexes for the loads we see later 
>> which consume these x-products:
>> 
>> 
>>     __ ldr(low_1, Address(sp));
>>     __ ldr(high_1, Address(sp, 2 * BytesPerLong));
>>     __ ldr(low, Address(sp, BytesPerLong));
>>     __ ldr(high, Address(sp, 3 * BytesPerLong));
>>     . . .
>>     __ ldr(low_1, Address(sp, 4 * BytesPerLong));
>>     __ ldr(high_1, Address(sp, 6 * BytesPerLong));
>>     . . .
>>     __ ldr(low, Address(sp, 5 * BytesPerLong));
>>     __ ldr(high, Address(sp, 7 * BytesPerLong));
>>     . . .
>> 
>> 
>> Any maintainer looking at these `ldr` instructions (including me in about 6 
>> months time) will end up having to search back through the preceding code to 
>> connect the offsets with the comment at the point of call to 
>> `vs_st1_interleaved`. We can avoid that redundant effort either with 
>> comments or clearer code. So, the choice is:
>> 
>> 1. Comment each ldr to point back to the explanation of the saved 
>> cross-product data layout provided at the call site.
>> 2. Change the write method to store the 64-bit elements interleaved in the 
>> more obvious order `(lo0, hi0, lo1, hi1, ...)`.
>> 
>> The second option seems far preferable to me.
>> 
>> We need to drop `vs_st1_interleaved` and replace it with he following 
>> overloaded variant of `vs_st2_post`:
>> 
>>   // store 2 x N-vector sequences interleaved into 2 * N quadword
>>   // memory locations via the address supplied in base using
>>   // post-increment addressing.
>>   template<int N>
>>   void vs_st2_post(const VSeq<N>& v1, const VSeq<N>& v2,
>>                    Assembler::SIMD_Arrangement T, Register base) {
>>     static_assert((N & (N - 1)) == 0, "sequence length must be even");
>>     for (int i = 0; i < N; i++) {
>>       __ st2(v1[i], v2[i], T, __ post(base, 32));
>>     }
>>   }
>> 
>> 
>> Then change the current call and comment:
>> 
>> 
>>     // Interleaves the 8 low products in A (l0, ..., l7)  and 8
>>     // high products in D (h1, ..., h7) into 8 quadwords on the
>>     // stack at mulptr: ((l0, h0), ..., (l7, h7))
>>     v2_st2_post(A, D, __ T2D, ...
>
> Addendum:
> 
> There is no need to include the static assert for even length in the 
> overloaded variant of `vs_st2_post` i.e. you can just use this:
> 
> 
>   // store 2 x N-vector sequences interleaved into 2 * N quadword
>   // memory locations via the address supplied in base using
>   // post-increment addressing.
>   template<int N>
>   void vs_st2_post(const VSeq<N>& v1, const VSeq<N>& v2,
>                    Assembler::SIMD_Arrangement T, Register base) {
>     for (int i = 0; i < N; i++) {
>       __ st2(v1[i], v2[i], T, __ post(base, 32));
>     }
>   }```

This idea sounds great, but, unfortunately, in st2(v1, v2, T, post(base, 32)) 
it is required that the register index of v2 be one more than that of v1, so I 
just added a comment at the consuming part.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3427018182

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64 [v5]

Reply via email to