Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

Maamoun TK Thu, 13 Oct 2022 04:55:53 -0700

It seems Debian release cycle takes ~2 year for every new version recently
https://wiki.debian.org/DebianReleases so I pushed a MR that enables
testing power9-specific code
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/53 since it's too
early to have qemu v7+ on stable release.


Also, I think we're set to proceed with Poly1305 multi-block patch based on
radix 2^44 for PowerPC
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 to approve the
new layout of process multiple blocks.

regards,
Mamone

On Sun, May 29, 2022 at 4:17 AM Maamoun TK <maamoun...@googlemail.com>
wrote:

> On Sat, May 14, 2022 at 8:07 PM Niels Möller <ni...@lysator.liu.se> wrote:
>
>> Maamoun TK <maamoun...@googlemail.com> writes:
>>
>> >  I created merge requests that have improvements of Poly1305 for arm64,
>> > powerpc64, and s390x architectures by following using two-way
>> interleaving.
>> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38
>> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
>> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
>> > The patches have 41.88% speedup for arm64, 142.95% speedup for
>> powerpc64,
>> > and 382.65% speedup for s390x.
>>
>> I've had a closer look at the ppc merge request #39.
>>
>> I think it would be good to do the single block radix 2^44 version first
>> (I'm assuming that's in itself is an improvement over the C code, and
>> over using radix 2^64?).
>
>
> I agree we should go with single block first. It seems radix 2^44 has a
> drawback in terms of single block performance in comparison to C code which
> is not the case for radix 2^64 that is superior in this matter. I got "657
> Mbyte/s" update speed of 2^64 whereas C code (radix 2^26) produces "470
> Mbyte/s" of update speed on POWER9. With that said, I've pushed a patch of
> 2^64 implementation for single block update with fat build support
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/47
>
>
>> Is 44 bit pieces ideal (130 = 44+44+42), or
>> would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 +
>> 56 + 18)?
>>
>
> For multi-block processing, it seems to me 44 bit pieces are ideal. In
> case B = 2^56 I'm trying to figure how to calculate B^3 R_1 5 = 2^38 R_1 5
> which doesn't fit in 64-bit since R_1 of degree 56. For 2^48, I don't see
> any difference in equations when comparing with 2^44 for multiplication and
> reduction phases.
>
>
>> For the 4-way code, the name and organization seems inspired by
>> chacha_4core, which is a bit different since it also has a four-block
>> output, and then the caller has to be aware. I think it would be better
>> to look at the recent ghash. Maybe one can have an internal
>> _poly1306_update, following similar conventions as _ghash_update? Then
>> the C code doesn't need to know how many blocks are done at a time,
>> which should make things a bit simpler (although the assembly code would
>> need logic to do left-over blocks, just like for ghash).
>>
>
> I agree, I've pushed a new MR
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 of
> poly1306_update implementation for PowerPC based on radix 2^44 for
> multi-block processing, and radix 2^64 to handle single-block update. The
> threshold can fit within the assembly file which is in this case set to "12
> blocks" since radix 2^64 implementation has relatively superior speed. I
> like the structure so far, please take a look and let me know if it fits
> well so I can implement it for other architectures.
>
>
>> > OpenSSL is still ahead in terms of performance speed since it uses 4-way
>> > interleaving or maybe more!!
>> > Increasing the interleaving ways more than two has nothing to do with
>> > parallelism since the execution units are already saturated by using
>> 2-ways
>> > for the three architectures. The reason behind the performance
>> improvement
>> > is the number of execution times of reduction procedure is cutted by
>> half
>> > for 4-way interleaving since the products of multiplying state parts by
>> key
>> > can be combined before the reduction phase. Let me know if you are
>> > interested in doing that on nettle!
>>
>> Good to know that 2-way is sufficient to saturate execution units. Going
>> to 4-way does have a startup cost for each call, since we don't have
>> space for extra pre-computed powers. But for large messages, we'll get
>> the best speed if we can make reduction as cheap as possible.
>>
>
> I understand 4-way has nothing to offer regarding 'vmsumudm' parallelism
> for multiplication phase but as you've mentioned reduce the product per
> 4-blocks implies performance improvement besides increasing parallelism
> level for side procedures like R64_TO_R44_4B macro which yield significant
> enhancement (approximately double the performance) over 2-way
> implementation on powerpc.
>
> regards,
> Mamone
>
>
>> Regards,
>> /Niels
>>
>> --
>> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
>> Internet email is subject to wholesale government surveillance.
>>
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

Reply via email to