On Mon, 15 Jun 2026 12:24:08 GMT, Andrew Dinn <[email protected]> wrote:
>> Yes. Accepted
>
> I still have a problem with this.
>
> With your definition of `vs_st1_interleaved` we end up with this weird
> organization of the multiplication products where we have two low pairs and
> then their corresponding high pairs repeated x 4. There is nothing in the
> code that explains this layout other than the comment here at the point of
> write (n.b. my 'over-engineered' suggestion was actually a way of making it
> explicit what was going on here).
>
> That results in some rather unobvious indexes for the loads we see later
> which consume these x-products:
>
>
> __ ldr(low_1, Address(sp));
> __ ldr(high_1, Address(sp, 2 * BytesPerLong));
> __ ldr(low, Address(sp, BytesPerLong));
> __ ldr(high, Address(sp, 3 * BytesPerLong));
> . . .
> __ ldr(low_1, Address(sp, 4 * BytesPerLong));
> __ ldr(high_1, Address(sp, 6 * BytesPerLong));
> . . .
> __ ldr(low, Address(sp, 5 * BytesPerLong));
> __ ldr(high, Address(sp, 7 * BytesPerLong));
> . . .
>
>
> Any maintainer looking at these `ldr` instructions (including me in about 6
> months time) will end up having to search back through the preceding code to
> connect the offsets with the comment at the point of call to
> `vs_st1_interleaved`. We can avoid that redundant effort either with comments
> or clearer code can avoid it. So, the choice is:
>
> 1. Comment each ldr to point back to the explanation of the saved
> cross-product data layout provided at the call site.
> 2. Change the write method to store the 64-bit elements interleaved in the
> more obvious order `(lo0, hi0, lo1, hi1, ...)`.
>
> The second option seems far preferable to me.
>
> We need to drop `vs_st1_interleaved` and replace it with he following
> overloaded variant of `vs_st2_post`:
>
> // store 2 x N-vector sequences interleaved into 2 * N quadword
> // memory locations via the address supplied in base using
> // post-increment addressing.
> template<int N>
> void vs_st2_post(const VSeq<N>& v1, const VSeq<N>& v2,
> Assembler::SIMD_Arrangement T, Register base) {
> static_assert((N & (N - 1)) == 0, "sequence length must be even");
> for (int i = 0; i < N; i++) {
> __ st2(v1[i], v2[i], T, __ post(base, 32));
> }
> }
>
>
> Then change the current call and comment:
>
>
> // Interleaves the 8 low products in A (l0, ..., l7) and 8
> // high products in D (h1, ..., h7) into 8 quadwords on the
> // stack at mulptr: ((l0, h0), ..., (l7, h7))
> v2_st2_post(A, D, __ T2D, mul_ptr);
Addendum:
There is no need to include the static assert for even length in the overloaded
variant of `vs_st2_post` i.e. you can just use this:
// store 2 x N-vector sequences interleaved into 2 * N quadword
// memory locations via the address supplied in base using
// post-increment addressing.
template<int N>
void vs_st2_post(const VSeq<N>& v1, const VSeq<N>& v2,
Assembler::SIMD_Arrangement T, Register base) {
for (int i = 0; i < N; i++) {
__ st2(v1[i], v2[i], T, __ post(base, 32));
}
}```
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/30941#discussion_r3413524508