Re: Fwd: [Arm64, PowerPC64, S390x] Optimize Poly1305

Niels Möller Wed, 19 Oct 2022 01:27:14 -0700

Maamoun TK <[email protected]> writes:

> I think the design could be as simple as always padding each block with
> 0x01 in _nettle_poly1305_update while keeping _nettle_poly1305_block that
> is responsible for processing last block takes variable padding values (0
> or 1).


I agree that should work. In some sense it would be simpler with a
single entry point, but what you propose is likely less complexity.

> I've implemented Poly1305 based on 2^44 in C and got the following
> benchmark results of update function tested on Core i5-10300H
> *-------------------------------------------------------------------*
> |  Radix                                            |      Mbyte/s       |
> |----------------------------------------------|---------------------|
> |  2^26  (C)                                       |         1397.53   |
> |  2^44  (C)                                       |         1734.09   |
> |  2^64  (C)                                       |         2645.74   |
> |  2^64  (C deferred partial folding)  |         3048.93   |
> |  2^64  (Assembly - latest)              |         4001.14   |
> *-------------------------------------------------------------------*
> I used two versions of 2^64 based C-implementations you shared earlier for
> this benchmark. As it stands, Poly1305 based on 2^64 with deferred partial
> folding performs better compared with other C implementations for single
> block processing.

Interesting! I wonder if it makes sense to have several implementations
in poly1305-internal.c under some ifdefs. Could be useful (i) as
reference, and (ii) to easily switch if, e.g., it turns that it is
beneficial to use different variants depending on platform.

> +  p1 += p0 >> 44;
> +  p2 += p1 >> 44;
> +  ctx->hh2 = 0x3ffffffffff & p2;
> +  p2 >>= 42;
> +  p0 = (p0 & 0xfffffffffff) + (p2 << 2) + p2;
> +  ctx->hh0 = 0xfffffffffff & p0;
> +  ctx->hh1 = (0xfffffffffff & p1) + (p0 >> 44);

To get the highest speed of this reduction, I think one should keep two ideas 
in mind:

1. It's not necessary to have a unique and fully reduced representation,
   one can allow hh0, hh1 and hh2 to be a bit or two larger than 44 bits.

2. Avoid long dependency chains, since dependent instructions can't run
   in parallel.

Let me think aloud...

Ideally, I'd like to have something like the below:

  uint64_t l0, l1, l2, h0, h1, h1;
 
  l0 = p0 & 0xfffffffffff; h0 = p0 >> 44;
  l1 = p1 & 0xfffffffffff; h1 = p1 >> 44;
  l2 = p2 & 0x3ffffffffff; h2 = p2 >> 42;

  ctx->hh0 = l0 + 5*h2;
  ctx->hh1 = l1 + h0;
  ctx->hh2 = l2 + h1;
 
But likely won't work, since bound on p0, p1, p2 is 3 * 2^88 plus any
excess bits we had on the input state, and therefore number of excess
bits will grow with each iteration (but we could possibly do this for a few
iterations before reducing further). So we probably need to fold excess
bits of some or all of p0, p1, p2. If we do it for all of them, that
would be

  uint64_t l0, l1, l2, h0, h1, h1, t0, t1, t2;
 
  l0 = p0 & 0xfffffffffff; h0 = (p0 >> 44) & 0xfffffffffff; t0 = p0 >> 88;
  l1 = p1 & 0xfffffffffff; h1 = (p1 >> 44) & 0xfffffffffff; t1 = p1 >> 86;
  l2 = p2 & 0x3ffffffffff; h2 = (p2 >> 42);

  h2 = 5 * (h2 + t1);
  t2 = h2 >> 44; h2 &= 0xfffffffffff;

  ctx->hh0 = l0 + h2;
  ctx->hh1 = l1 + h0 + t2;
  ctx->hh2 = l2 + h1 + t0;

The deepest dependency is for the pieces that are folded via the
multiplication by 5. So maybe take extract those bits as early as
possible, and do a plain addition chain to fold the low part. Then we
get

  uint64_t t, l2, l1, l0;
  /* Extract high bits to be folded */
  t = (p2 >> 42) + p1 >> 86;
  /* The & with 128-bit constant should be just & 0x3fffff
     applied to high half of p1. */
  l2 = p2 & 0x3ffffffffff; p1 &= (~(unsigned __int128)0) >> 42);

  /* Fold low parts. */
  l0 = p0 & 0xfffffffffff; p1 += p0 >> 44;
  l1 = p1 & 0xfffffffffff; l2 += p1 >> 44;
  
  t *= 5; /* Could be a lea instruction on x86_64 */
  /* No carry propagation */
  l0 += (t & 0xfffffffffff);
  l1 += t >> 44; 

There's a cost to this, in that we have to split both p0 and t into high
and low parts; if we added in p0 += t first we'd only need to split
once. But this organization should provide lots of parallelism.
Intention is that the critical path should be the path involving the
multiply by 5. That path consists of
 
1. Two parallel shifts of p2 and (high half of) p1.
2. Add the shifted values together (64 bit add).
2. The multiply (shift + add).
3. Parallel shift + and.
4. Two parallel additions to l0 and l1.

The dependent addition chain p0 -> p1 -> l2 could hopefully be evaluated
in parallel, so that it's all completed in 5-6 cycles. Not sure if there
will be any gain in practice, the reduction you use seems to have a
dependency chain that's just one or two cycles deeper.

With the reduction above, I think the state variables will be reduced to
< 2^45, which must be reduced canonically as part of the digest
processing.

Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: Fwd: [Arm64, PowerPC64, S390x] Optimize Poly1305

Reply via email to