https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510

--- Comment #4 from Devin Hussey <husseydevin at gmail dot com> ---
I am deciding to refer to goodmul as ssemul from now on. I think it is a better
name.

I am also wondering if Aarch64 gets a benefit from this vs. scalarizing if the
value is already in a NEON register. I don't have an Aarch64 device to test on.
For the reference, I use an LG G3 with a Snapdragon 801 (Cortex-A15)
underclocked to 4 cores @ 1.7 GHz.

I also did some testing, and twomul is also fastest if a value can be
interleaved outside of the loop (e.g. a constant). ssemul is only fastest if
either both operands can be interleaved beforehand or the high or low bits are
known to be zero in which it can be simplified.

For example, the xxHash64 routine,  which looks like this:

const U8 *p;
const U8 *limit = p + len - 31;
U64x2 v[2];
...
do {
    // Actually unrolled
    for (int i = 0; i < 2; i++) {
        // Load (U8 load because alignment is dumb)
        U64x2 inp = vreinterpretq_u64_u8(vld1q_u8(p));
        p += 16;
        v[i] += inp * PRIME64_2;
        v[i]  = (v[i] << 31) | (v[i] >> (64 - 31));
        v[i] *= PRIME64_1;
    }
} while (p < limit);

seems to be the fastest when implemented like this:


// Wordswap and separate low bits for twomul
const U64x2 prime1Base = vdupq_n_u64(PRIME64_1);
const U32x2 prime1Lo = vmovn_u64(prime1Base);
const U32x4 prime1Rev = vrev64q_u32(vreinterpretq_u32_u64(prime1Base));

// Interleave for ssemul
_Alignas(16) const U64 PRIME2[2] = { PRIME64_2, PRIME64_2 };
const U32x2x2 prime2 = vld2_u32((const U32 *)__builtin_assume_aligned(PRIME2,
16));

U64x2 v[2];
do {
    // actually unrolled
    for (int i = 0; i < 2; i++) {
        // Interleaved load
        U32x2x2 inp = vld2_u32((const U32 *)p);
        p += 16;

        // ssemul
        // val = (U64x2)inpLo * (U64x2)prime2Hi;
        U64x2 val = vmull_u32(inp.val[0], prime2.val[1]);

        // val += (U64x2)inpHi * (U64x2)prime2Lo;
        val = vmlal_u32(val, inp.val[1], prime2.val[0]);

        // val <<= 32;
        val = vshlq_n_u64(val, 32);

        // val += (U64x2)inpLo * (U64x2)prime2Lo;
        val = vmlal_u32(val, inp.val[0], prime2.val[0]);
        // end ssemul

        // Add
        v[i] = vaddq_u64(v[i], val);

        // Rotate left
        v[i] = vsriq_n_u64(vshlq_n_u64(v[i], 31), v[i], 33);

        // twomul
        // topLo = v[i] & 0xFFFFFFFF;
        U32x2 topLo = vmovn_u64(v[i]);

        // top = (U32x4)v[i];
        U32x4 top = vreinterpretq_u32_u64(v[i]);

        // prod = {
        //   topLo * prime1Hi,
        //   topHi * prime1Lo
        // };
        U32x4 prod = vmulq_u32(top, prime1Rev);

        // prod64 = (U64x2)prod[0] + (U64x2)prod[1];
        U64x2 prod64 = vpaddlq_u32(prod);

        // prod64 <<= 32;
        prod64 = vshlq_n_u64(prod64, 32);

        // prod64 += (U64x2)topLo * (U64x2)prime1Lo;
        prod64 = vmlal_u32(prod64, topLo, prime1Lo);
        // end twomul
    } 
} while (p < limit);

As you can see, since we can do an interleaved load on p, it is fastest to do
ssemul, however, since we are using v for more than just multiplication, we use
twomul.

On my G3 in Termux with the xxhsum 100 KB benchmark, this gets to 2.65 GB/s,
compared to 0.8 GB/s scalar and 2.24 GB/s with both of them using ssemul.
However, this was compiled with Clang. For some reason, even though I see no
major differences in the assembly, GCC consistently produces code at roughly
80% the performance of Clang. But this is mostly an algorithm thing, that isn't
important.

Considering that this is 64-bit arithmetic on a 32-bit device, that is pretty
good.

Reply via email to