On Fri, Oct 28, 2022 at 1:30 PM Maamoun TK <maamoun...@googlemail.com> wrote:
> On Fri, Oct 28, 2022 at 3:58 PM David Edelsohn <dje....@gmail.com> wrote: > >> On Fri, Oct 28, 2022 at 7:34 AM Maamoun TK <maamoun...@googlemail.com> >> wrote: >> >>> I created a new merge request for poly1305 multi block implementation >>> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/55 >>> The patch is based on radix 2^44 with linear carry addition for >>> reduction. >>> I tried multiple scenarios to best performing _nettle_poly1305_blocks on >>> ppc and got interesting results. >>> *----------------------------------------------------------* >>> | Radix | Mbyte/s | >>> |------------------------------------|---------------------| >>> | 2^44 (Reduction 1) | 1691.39 | >>> | 2^44 (Reduction 2) | 1812.81 | >>> | 2^ 44 (Linear reduction) | 2044.05 | >>> | 2^64 | 1002.27 | >>> *---------------------------------------------------------* >>> Reductions 1/2 refer to two versions of 2^44 reduction you shared earlier >>> on this thread. While these reductions have more speedup on C, the linear >>> reduction still has a superior performance in the vectorized version on >>> ppc. >>> *----------------------------------------------------------------* >>> | Radix | Mbyte/s | >>> |------------------------------------------|---------------------| >>> | 2^44 (C - Linear reduction) | 1734.09 | >>> | 2^44 (C - Reduction 1) | 2035.56 | >>> | 2^44 (C - Reduction 2) | 2044.31 | >>> *---------------------------------------------------------------* >>> I still manage to monitor the best performance on C using radix-64 with >>> deferred partial folding so having it as a default path along with 2^26 >>> implementation for legacy purposes in conditional preprocessor directives >>> would fit well in my opinion. >>> >>> > ppc only offers instruction to shift 128-bit value by octet >>> After trying several formulas to get around this, I found that the best >>> solution is to stick with regular reduction and utilize POWER10-specific >>> instruction 'vsrq' for Vector shift right quadword instead of vsro/vsrd >>> instructions but I wouldn't go with that for now since it's early to make >>> use of Power ISA v3.1 >>> >> >> Why is it early to use POWER10 instructions? Various BLAS libraries >> have been updated to use POWER10 instructions. >> > > I just checked out the compatibility of POWER10 instructions in our test > setup in Gitlab CI. 'vsrq' instruction is implemented in qemu v7.1.0 which > is good because we're able to grab that particular version from debian > bullseye-backports packages. Also, POWER10 instructions are supported in > GAS of binutils v2.35 that can be fetched normally from bullseye packages > as I can see here > https://packages.debian.org/bullseye/binutils-powerpc64le-linux-gnu > > With that said, it's ok to targeting POWER10 arch in nettle but it opens > up other potentials than the current little improvement. As I remember, > POWER10 processor has additional built-in cryptography cores that may be > useful to take advantage of in algorithms like AES, GCM.. > does power cryptography team have any notes about that yet? > Hi, Mamone Exploiting the POWER10 cryptographic accelerator would be even better. And hopefully POWER10 systems will be available for Open Source developers before the end of the year. The IBM Power cryptography team has already performed a lot of planning for the implementation of algorithms using the POWER10 cryptographic instructions. Please reach out to our friends on the IBM Power Cryptography team for guidance. Thanks, David > > regards, > Mamone > > >> Thanks, David >> >> >>> >>> regards, >>> Mamone >>> >>> On Tue, Oct 25, 2022 at 8:23 PM Niels Möller <ni...@lysator.liu.se> >>> wrote: >>> >>> > Maamoun TK <maamoun...@googlemail.com> writes: >>> > >>> > > I did the benchmark on my laptop too. I got a speed of 3964.37 GB/s >>> on >>> > > upstream and 5054.32 GB/s benchmarking poly1305 update on the new >>> > branch. I >>> > > wonder if the result numbers are truncated on your end because that >>> would >>> > > keep the improvement on context with my test (25%). >>> > >>> > I did round, it was something like 2990 MB/s vs 5010 MB/s. But it's >>> > going to be rather machine dependent. Those numbers were from a AMD >>> > Ryzen 5 something (laptop is two years old, iirc). Now I've tried on my >>> > intel box too (i3-5010U, much older). >>> > >>> > I get >>> > MB/s cycles/block >>> > before: 1.8 70 >>> > after: 2.4 53 >>> > >>> > so on this machine, appr. 30% speedup. >>> > >>> > > I will give multiblock radix-2^64 a try on ppc to examine the result. >>> > >>> > I'm curious to see how that works out, if results extend beyond x86. >>> > >>> > > ppc only offers instruction to shift 128-bit value by octet making >>> > > something like the following sketch is more ideal >>> > >>> > Not sure if that helps, but the purpose of the shift operations that >>> > need to operate on 128-bit values is to extract 64 bits in the middle, >>> a >>> > bit like the x86 "double precision shift" instructions shld/shrd. >>> > Everything else can use 64-bit registers. (Even though if the 128-bit >>> > accumualtion is done in vector registers, I guess one ought to keep all >>> > data in the vector registers, also for the folding). >>> > >>> > And since we have highest bit always zero, it may also work to round >>> > shift count down, extract low 64 bits, followed by a shift of a few >>> > bits more operating only on the 64-bit value. Something like >>> > >>> > unsigned __int128 p = ...; >>> > uint64_t t = p >> 40; >>> > t >>= 2; /* or >>= 4, if we want a 44 bit shift rather than 42 */ >>> > >>> > Should be correct provided that our bounds guarantee that >>> > >>> > p < 2^{40 + 64}. >>> > >>> > > l0 = p0 & 0x3ffffffffff; h0 = (p0 >> 40) ... >>> > > >>> > > I've thought of radix 2^48 but the fact that r^2_1' = r^2_1 * 5 << 14 >>> > > (where r^2_1 is of degree 48) needs 65-bit to fit in make it a little >>> > > harder to apply. >>> > >>> > Hmm. It might work to do a 65-bit constant. To do r' * h, where r is 65 >>> > bits >>> > and h is 48 (and possible one or two excess bits), say r_0 = r mod >>> 2^64, >>> > r_1 = - (r>>64) (all ones if and only if r >= 2^64). >>> > >>> > Then >>> > >>> > r * h = r_0 * h + (r_1 & h) 2^64 >>> > >>> > So first multiply by the 64 low bits, then do a conditional (via and >>> and >>> > operation) addition into the high half. If working with regular 64-bit >>> > registers, it's not too bad, but I guess it may get rather messy with >>> > vector registers. So perhaps not the first thing to try out. >>> > >>> > Regards, >>> > /Niels >>> > >>> > -- >>> > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. >>> > Internet email is subject to wholesale government surveillance. >>> > >>> _______________________________________________ >>> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se >>> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se >>> >> _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se