Re: Fwd: [Arm64, PowerPC64, S390x] Optimize Poly1305

Niels Möller Wed, 19 Oct 2022 04:15:18 -0700

ni...@lysator.liu.se (Niels Möller) writes:

> Ideally, I'd like to have something like the below:
>
>   uint64_t l0, l1, l2, h0, h1, h1;
>  
>   l0 = p0 & 0xfffffffffff; h0 = p0 >> 44;
>   l1 = p1 & 0xfffffffffff; h1 = p1 >> 44;
>   l2 = p2 & 0x3ffffffffff; h2 = p2 >> 42;
>
>   ctx->hh0 = l0 + 5*h2;
>   ctx->hh1 = l1 + h0;
>   ctx->hh2 = l2 + h1;
>  
> But likely won't work, since bound on p0, p1, p2 is 3 * 2^88 plus any
> excess bits we had on the input state, and therefore number of excess
> bits will grow with each iteration (but we could possibly do this for a few
> iterations before reducing further).


Let me try to quantify this. Say the input state are numbers < a 2^44.
Then we get 

  p0, p1, p2 < 3a 2^88

Next,

  h0, h1 < 3a 2^44
  h2 < 3a 2^46

The largest output is then the new hh0,

  hh0 < 2^44 + 5 (3a 2^46) = (60 a + 1) 2^44 

So the state bound grows by almost 6 bits (could probably be sligthly
less with a more careful analysis).

So say we start with a = 2 (state numbers < 2^45), then after three
iterations they have grown to 63 bits, and we'd have to do a round of
additional folding to get them back to < 2^45.

Might be used instead of, or in addition to, doing multiple blocks using
additional precomputed powers of the key, there are many options. 

E.g., precompute one k^2, do two blocks at a time (each pk is then the
accumulation of 6 products rather than three, so one bit larger than
above). Do the half way reduction as above, then another two blocks, and
the more expensive reduction back to 45 bits. 

So a loop doing four blocks per iteration using only one additional
precomputed key power. With more careful analysis, could perhaps be
extended to 6 blocks.

Looking at dependent operations, I estimate that the simple reduction
gives a four cycles dependency. To hide some of that, I think the
computation of the p0, p1, p2 should be ordered so that h0 is used last
of the inputs, and p2 is ready first (i.e., first compute and accumulate
the products involving h1, h2 inputs, then compute the h0 product that
is needed to complete into p2, then the other two products involving h0.

Minimum latency from the dependencies of a single-block loop would then
be 5-6 cycles plus the latency for a full multiplication. It's rather
complex, and to get close to full utilizaton of multipliers, I imagine
the latency for the state word with the cheapest reduction matters too.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: Fwd: [Arm64, PowerPC64, S390x] Optimize Poly1305

Reply via email to