[email protected] (Niels Möller) writes:
> Maamoun TK <[email protected]> writes:
>
>> I would like to explain more about 'vpmsumd' instruction, in x86 arch the
>> 'pclmulqdq' instruction is used for carry-less operations. To use
>> 'pclmulqdq' an immediate value should be passed to the third parameter of
>> the instruction to specify which doublewords will be multiplied. However,
>> 'vpmsumd' do the following operation:
>> (High-order doubleword of the second parameter * High-order doubleword of
>> the third parameter) XOR (Low-order doubleword of the second parameter *
>> Low-order doubleword of the third parameter)
>
> Interesting! Do you use inputs where one doubleword is zero, making one
> or the other of the xored values be zero, or is there some clever way to
> take advantage of the buiiltin wraparound? I guess one can also do some
> interesting things of other selected parts of the inputs zero, for
> example, the middle word of one of the operands, or all odd-numbered
> bits, or...
Ok, let me see if I can get the math right first (and ignore the issues
if bit order and reversal). Thanks a lot for this the explanation of
this curious instruction, I hope I have understood it corectly, because
it seems really useful.
We have the polynomial p(x) = x^128 + x^7 + x^2 + x + 1 over Z_2. Which
implies that
x^128 = x^2 + x + 1 (mod p(x))
so let's call this simpler polynomial p'(x) = x^2 + x + 1.
As input we have two polynomials, one depending on the key (i.e.,
invariant when processing a large message), represented as 128 bits
each. Split them in 64-bit pieces we can reason about,
A = a_1(x) x^64 + a_0(x), B = b_1(x) x^64 + b_0(x),
with (unreduced) product C(x) = A(x) B(x), split in 64-bit pieces as
C(x) = c_3(x) x^192 + c_2(x) x^128 + c_1(x) x^64 + c_0 (x)
(where c_3 actually is only 63 bits). To reduce, we use x^128 = p'(x)
(mod p(x)). We'd then need to multiply the high half with p'(x), say
D = (c_3(x) x^64 + c_2(x) ) p'(x) = d_2(x) x^128 + d_1(x) 2^64 + d_0(x)
where we get the high few bits in d_2(x). So we'd need to compute d_2
p'(x) again. And finally, we get the result C(x) mod p by summing/xoring
c_1(x) x^64 + c_0(x)
d_1(x) x^64 + d_0(x)
d2(x) p'(x)
--------------------------
Now, let's look closer at the initial multiplication, A(x) B(x), which
can be expanded as
a_1(x) b_1(x) x^128 + (a_1(x) b_0(x) + a_0(x) b_1(x)) x^64 + a_0(x) b_0(x)
I see two tricks: First, we can precompute a'_2(x) x^64 + a'_1(x) = a_1(x)
x^128 (mod
(p(x)), and replace the first term with
(a'_2(x) x^64 + a'_1(x)) b_1(x)
That should eliminate the annoying high bits d_2 in the reduction, at
the cost one one more 64x64 multiply. I would expect that's the same
number of operations, but more shallow dependency path. We get a 192-bit
partially reduced result, instad of the full 256-bit result. Let's
denote this product as
h_2(x) x^128 + h_1(x) x^64 + h_0(x) = (a'_2(x) x^64) + a'_1(x)) b_1(x)
Second, if we also prepare a register with the swapped A, a_0(x) x^64 +
a_1(x), then the vpmsumd instruction, if I understand you correctly, can
compute the middle term (a_1(x) b_0(x) + a_0(x) b_1(x)) in one
operation. So we get down to 3 64x64 multiplies for free, without tghe
reconstruction needed for standard Karatsuba multiplication. So if we
compute the middle term like this,
m_1(x) x^64 + m_0(x) = a_1(x) b_0(x) + a_0(x) b_1(x)
The part we need to fold is m_1(x) + h_2(x)
then we get the fully reduced result by summing
a_0(x) b_0(x) (127 bits, one vpmsumd with one input half zero)
m_0(x) (64 bits)
h_1(x) h_0(x) (128 bits)
(m_1(x) + h_2(x))p'(x) (70 or so bits)
If I count it correctly, we get the fully reduced result with 5 vpmsumd
multiplicatiions, one for a_0 b_0, one for the midle terms, two for the
miltiplication with a', and one for the only explicit multiplication
with p'(x). And if we do multiple blocks in parallel, they can share
share this final multiply.
There are sure further tricks, e.g., it seems the a_0(x) b_0(x) multiply
can share an vpmsumd instruction with the a'_1(x) b_1(x) multiply,
if we just prepare a register with a'_1(x) x^64 + a_0(x); then the
replacement of a_1 with the twice as large a' values seems for free. One
could also attempt to let operations corresponding to separate input
blocks share vpmsumd instructions.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs