Hi Niels, Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks. -Danny > On Nov 22, 2023, at 2:27 AM, Niels Möller <[email protected]> wrote: > > Danny Tsen <[email protected]> writes: > >> Interleaving at the instructions level may be a good option but due to >> PPC instruction pipeline this may need to have sufficient >> registers/vectors. Use same vectors to change contents in successive >> instructions may require more cycles. In that case, more >> vectors/scalar will get involved and all vectors assignment may have >> to change. That’s the reason I avoided in this case. > > To investigate the potential, I would suggest some experiments with > software pipelining. > > Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the > round loop. I think that should be 44 instructions of aes mangling, plus > instructions to setup the counter input, and do the final xor and > endianness things with the message. Arrange so that it loads the AES > state in a set of registers we can call A, operating in-place on these > registers. But at the end, arrange the XORing so that the final > cryptotext is located in a different set of registers, B. > > Then, write the instructions to do ghash using the B registers as input, > I think that should be about 20-25 instructions. Interleave those as > well as possible with the AES instructions (say, two aes instructions, > one ghash instruction, etc). > > Software pipelining means that each iteration of the loop does aes-ctr > on four blocks, + ghash on the output for the four *previous* blocks (so > one needs extra code outside of the loop to deal with first and last 4 > blocks). Decrypt processing should be simpler. > > Then you can benchmark that loop in isolation. It doesn't need to be the > complete function, the handling of first and last blocks can be omitted, > and it doesn't even have to be completely correct, as long as it's the > right instruction mix and the right data dependencies. The benchmark > should give a good idea for the potential speedup, if any, from > instruction-level interleaving. > > I would hope 4-way is doable with available vector registers (and this > inner loop should be less than 100 instructions, so not too > unmanageable). Going up to 8-way (like the current AES code) would also > be interesting, but as you say, you might have a shortage of registers. > If you have to copy state between registers and memory in each iteration > of an 8-way loop (which it looks like you also have to do in your > current patch), that overhead cost may outweight the gains you have from > more independence in the AES rounds. > > Regards, > /Niels > > -- > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. > Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
