Maamoun TK <[email protected]> writes:

> This is great information that I can keep in my memory for next
> implementations. s390x arch offers 'xc' instruction "Storage-to-storage
> XOR" at maximum length of 256 bytes but we can do as many iterations as we
> need. I optimized memxor using that instruction as it achieves the optimal
> performance for such case, I'll attach the patch at the end of
> message.

Nice! I'd like to merge this as soon as the s390x ci is up and running
again.

> Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction
> because while it supports the overlapped operands it processes them from
> left to right, one byte at a time.

Hmm, I wonder if there's some way to work around that.

> However, I think optimizing just memxor could make a good sense of how much
> it would increase the performance of AES modes. CBC mode could come in
> handy here since it uses memxor in encrypt and decrypt operations in case
> the operands of decrypt operation don't overlap. Here is the benchmark
> result of CBC mode:
>
> *---------------------------------------------------------------------------------------------------*
> |                                              AES-128 Encrypt | AES-128
> Decrypt |
> |------------------------------------------------------------------------|----------------------------|
> | CBC-Accelerator                             1.18 cbp     |     0.75 cbp
>         |
> | Basic AES-Accelerator                    13.50 cbp   |     3.34 cbp
>       |
> | Basic AES-Accelerator with memxor 15.50         |     1.57
>   |
> *-----------------------------------------------------------------------------------------------------*

This seems to confirm that cbc encrypt is the operation that gains the
most from assembly for the combined operation. That aes decrypt can also
gain a factor two in performance, does that mean that both aes-cbc and
memxor run at speed limited by memory bandwidth? And then the gain is
from one less pass loading and storing data from memory?

What unit is "cbp"? If it's cycles per byte, 0.77 cycles/byte for memxor
(the cost of "Basic AES-Accelerator with memxor" minus cost of
CBC-Accellerator) sounds unexpectedly slow, compared to, e.g, x86_64,
where I get 0.08 cycles per byte (regardless of alignment), or 0.64
cycles per 64-bit word.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to