[email protected] (Niels Möller) writes:

> So if we have the input in register A (loaded from memory with no
> processing besides ensuring proper *byte* order), and precompute two
> values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x)
> x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
>
>   vpmsumd R, M, A
>   vpmsumd F, L, A
>
> When doing more than one block at a time, I think it's easiest to
> accumulate the R and F values separately.

BTW, I wonder if similar organization would make sense for Arm Neon.
Now, Neon doesn't have vpmsumd, the widest carryless multiplication
available is vmull.p8, which is an 8-bit to 15-bit multiply, 8 in
parallel.

I'm sketching an instruction sequence doing the equivalent of two
vpmsumd using 32 vmull.p8, with good parallelism and not too many
instructions to shuffle around data to the right places. Is that a good
idea? To be compared to what the C code does, a loop of 16 iterations,
each doing some table lookup, shift and xoring.

With this large number of multiply instructions, it might pay off to use
Karatsuba, which could reduce it to 24 multiples (one level) or 18 (two
levels), at the cost of more xors and data movement instructions, and
lots of complexity.

(There have been ARM Neon code for gcm posted to the list earlier, but if I
remember correctly, that code didn't work in bit-reversed representation,
but used a bunch of explicit reversal operations).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to