On Tue, Sep 2, 2025 at 22:38 Richard Henderson <richard.hender...@linaro.org> wrote:
> On 9/1/25 23:38, Max Chou wrote: > > +#define OPMVV_VQDOTQ(NAME, TD, T1, T2, TX1, TX2, HD, HS1, HS2) > \ > > +static void do_##NAME(void *vd, void *vs1, void *vs2, int i) > \ > > +{ > \ > > + int idx; > \ > > + T1 r1; > \ > > + T2 r2; > \ > > + TX1 *r1_buf = (TX1 *)vs1 + HD(i); > \ > > + TX2 *r2_buf = (TX2 *)vs2 + HD(i); > \ > > + TD acc = *((TD *)vd + HD(i)); > \ > > + int64_t partial_sum = 0; > \ > > I think it's clear partial_sum should be the 32-bit type TD. > Indeed, I'm not sure why you don't just have > > TD acc = ((TD *)vd)[HD(i)]; Thanks for the suggestion. I’ll update version 2 for this part. > > > + > \ > > + for (idx = 0; idx < 4; ++idx) { > \ > > + r1 = *((T1 *)r1_buf + HS1(idx)); > \ > > + r2 = *((T2 *)r2_buf + HS2(idx)); > \ > > + partial_sum += (r1 * r2); > \ > > acc += r1 * r2; > > > + } > \ > > + *((TD *)vd + HD(i)) = (acc + partial_sum) & MAKE_64BIT_MASK(0, 32); > \ > > ((TD *)vd)[HD(i)] = acc; > > because that final mask is bogus. > The partial_sum and the final mask are created to ensure the behavior described in the Zvqdotq isa spec section 3 as follows: “Finally, the four products are accumulated into the corresponding element of vd, wrapping around signed overflow.” Thanks, Max.