It seems Debian release cycle takes ~2 year for every new version recently https://wiki.debian.org/DebianReleases so I pushed a MR that enables testing power9-specific code https://git.lysator.liu.se/nettle/nettle/-/merge_requests/53 since it's too early to have qemu v7+ on stable release.
Also, I think we're set to proceed with Poly1305 multi-block patch based on radix 2^44 for PowerPC https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 to approve the new layout of process multiple blocks. regards, Mamone On Sun, May 29, 2022 at 4:17 AM Maamoun TK <maamoun...@googlemail.com> wrote: > On Sat, May 14, 2022 at 8:07 PM Niels Möller <ni...@lysator.liu.se> wrote: > >> Maamoun TK <maamoun...@googlemail.com> writes: >> >> > I created merge requests that have improvements of Poly1305 for arm64, >> > powerpc64, and s390x architectures by following using two-way >> interleaving. >> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38 >> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 >> > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41 >> > The patches have 41.88% speedup for arm64, 142.95% speedup for >> powerpc64, >> > and 382.65% speedup for s390x. >> >> I've had a closer look at the ppc merge request #39. >> >> I think it would be good to do the single block radix 2^44 version first >> (I'm assuming that's in itself is an improvement over the C code, and >> over using radix 2^64?). > > > I agree we should go with single block first. It seems radix 2^44 has a > drawback in terms of single block performance in comparison to C code which > is not the case for radix 2^64 that is superior in this matter. I got "657 > Mbyte/s" update speed of 2^64 whereas C code (radix 2^26) produces "470 > Mbyte/s" of update speed on POWER9. With that said, I've pushed a patch of > 2^64 implementation for single block update with fat build support > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/47 > > >> Is 44 bit pieces ideal (130 = 44+44+42), or >> would anything get simpler with, e.g., 130 = 48 + 48 + 34, or 130 = 56 + >> 56 + 18)? >> > > For multi-block processing, it seems to me 44 bit pieces are ideal. In > case B = 2^56 I'm trying to figure how to calculate B^3 R_1 5 = 2^38 R_1 5 > which doesn't fit in 64-bit since R_1 of degree 56. For 2^48, I don't see > any difference in equations when comparing with 2^44 for multiplication and > reduction phases. > > >> For the 4-way code, the name and organization seems inspired by >> chacha_4core, which is a bit different since it also has a four-block >> output, and then the caller has to be aware. I think it would be better >> to look at the recent ghash. Maybe one can have an internal >> _poly1306_update, following similar conventions as _ghash_update? Then >> the C code doesn't need to know how many blocks are done at a time, >> which should make things a bit simpler (although the assembly code would >> need logic to do left-over blocks, just like for ghash). >> > > I agree, I've pushed a new MR > https://git.lysator.liu.se/nettle/nettle/-/merge_requests/48 of > poly1306_update implementation for PowerPC based on radix 2^44 for > multi-block processing, and radix 2^64 to handle single-block update. The > threshold can fit within the assembly file which is in this case set to "12 > blocks" since radix 2^64 implementation has relatively superior speed. I > like the structure so far, please take a look and let me know if it fits > well so I can implement it for other architectures. > > >> > OpenSSL is still ahead in terms of performance speed since it uses 4-way >> > interleaving or maybe more!! >> > Increasing the interleaving ways more than two has nothing to do with >> > parallelism since the execution units are already saturated by using >> 2-ways >> > for the three architectures. The reason behind the performance >> improvement >> > is the number of execution times of reduction procedure is cutted by >> half >> > for 4-way interleaving since the products of multiplying state parts by >> key >> > can be combined before the reduction phase. Let me know if you are >> > interested in doing that on nettle! >> >> Good to know that 2-way is sufficient to saturate execution units. Going >> to 4-way does have a startup cost for each call, since we don't have >> space for extra pre-computed powers. But for large messages, we'll get >> the best speed if we can make reduction as cheap as possible. >> > > I understand 4-way has nothing to offer regarding 'vmsumudm' parallelism > for multiplication phase but as you've mentioned reduce the product per > 4-blocks implies performance improvement besides increasing parallelism > level for side procedures like R64_TO_R44_4B macro which yield significant > enhancement (approximately double the performance) over 2-way > implementation on powerpc. > > regards, > Mamone > > >> Regards, >> /Niels >> >> -- >> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. >> Internet email is subject to wholesale government surveillance. >> > _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se