I've spent an evening taking a closer look at intel AES performance,
after it was pointed out to me by Torbjörn that the Intel AESNI
instructions are highly pipelined on most or all processors supporting
them. Meaning that the processor can start execution of one even two
instructions per cycle, provided that they are independent, while it
takes several cycles until the results become available for depending
instructions.
Out-of-order (OoO) execution can help to run things in parallel, even if
the instruction stream locally is a sequence of dependent instructions.
E.g., my main development machine has a broadwell processor, and it can
issue one aesni instruction per cycle, and I think the latency is 7 cycles.
Encrypting one block needs 10 aesni instructions (one per round, and then
there's a eleventh subkey applied with plain xor which can issue in
parallel with another instruction in the same cycle).
When I benchmark, aes128 ECB runs in 10.2 cycles per block, which means
that out-of-order execution is very successful, executing many
iterations of the loop in parallel. CBC encrypt, which is inherently
non-parallel, runs *much* slower, I get 91 cycles per block, where the
latency of just the aes encyption wold be 71 cycles (if 7 cycles latency
per aesni instruction is correct, and then one more cycle for the xor
subkey, the remaining 20 cycles for the CBC processing which seems to be
a bit slow in itself).
I'm considering rearranging the loops to interleave multiple blocks. See
below code which implements 4-way interleaving (a drop-in replacement
for x86_64/aesni/aes-encrypt-internal). If this approach turns out to be
useful, might be beneficial to extend to 8-way interleaving: That would
allow instructions to run nicely in parallel even without any
out-of-order-execution.
However, the interleaved code makes no change to performance in my
benchmarks on my machine. Since the old code is close to 10 cycles /
block, which is a hard limit from instruction issue of the aesni
instructions, and it seems almost all other instructions are already
executing in parallel with them.
But it might be an improvement on other processors, or for applications
that process a small number of blocks at a time (in the middle between
CBC which always does one block at a time, and the ECB benchmark which
does 10 KB, or 640 blocks, at a time), by making things easier for the
processor's OoO-machinery.
If you have any application benchmarks it would be interesting if you
could try out the interleaved version. I'm also thinking that maybe we
should add benchmarks for various message sizes?
Regards,
/Niels
C x86_64/aesni/aes-encrypt-internal.asm
ifelse(<
Copyright (C) 2015, 2018 Niels Möller
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or
modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.
or
* the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received copies of the GNU General Public License and
the GNU Lesser General Public License along with this program. If
not, see http://www.gnu.org/licenses/.
>)
C Input argument
define(<ROUNDS>, <%rdi>)
define(<KEYS>, <%rsi>)
C define(<TABLE>, <%rdx>) C Unused here
define(<LENGTH>,<%rcx>)
define(<DST>, <%r8>)
define(<SRC>, <%r9>)
define(<CNT>, <%r10>)
define(<TMP>, <%r10>) C Can overlap CNT
define(<TAB>, <%rdx>)
define(<SUBKEY>, <%xmm0>)
define(<B0>, <%xmm1>)
define(<B1>, <%xmm2>)
define(<B2>, <%xmm3>)
define(<B3>, <%xmm4>)
.file "aes-encrypt-internal.asm"
C _aes_encrypt(unsigned rounds, const uint32_t *keys,
C const struct aes_table *T,
C size_t length, uint8_t *dst,
C uint8_t *src)
.text
ALIGN(16)
PROLOGUE(_nettle_aes_encrypt)
W64_ENTRY(6, 5)
shr $4, LENGTH
test LENGTH, LENGTH
jz .Lend
C Each round uses 16 bytes of subkeys, i.e., 16 bytes. We have
C an initial xor round, rounds-1 regular rounds, and one final
C round. Adjust so that ROUNDS reflects the number of regular
C rounds, KEYS points to the key for the final round, and KEYS
C + ROUNDS points to the key for the first regular round.
dec XREG(ROUNDS)
shl $4, XREG(ROUNDS)
lea 16(ROUNDS, KEYS), KEYS
neg ROUNDS
jmp .Loop_end
.Lloop_4w:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
movups 16(SRC), B1
movups 32(SRC), B2
movups 48(SRC), B3
pxor SUBKEY, B0
pxor SUBKEY, B1
pxor SUBKEY, B2
pxor SUBKEY, B3
.Lround_loop_4w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
aesenc SUBKEY, B1
aesenc SUBKEY, B2
aesenc SUBKEY, B3
jne .Lround_loop_4w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
aesenclast SUBKEY, B1
aesenclast SUBKEY, B2
aesenclast SUBKEY, B3
movups B0, (DST)
movups B1, 16(DST)
movups B2, 32(DST)
movups B3, 48(DST)
add $64, SRC
add $64, DST
sub $4, LENGTH
.Loop_end:
cmp $4, LENGTH
jnc .Lloop_4w
lea .Ljmptab(%rip), TAB
movslq (TAB, LENGTH, 4), TMP
lea (TAB, TMP), TMP
jmp *TMP
.Ltail3:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
movups 16(SRC), B1
movups 32(SRC), B2
pxor SUBKEY, B0
pxor SUBKEY, B1
pxor SUBKEY, B2
.Lround_loop_3w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
aesenc SUBKEY, B1
aesenc SUBKEY, B2
jne .Lround_loop_3w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
aesenclast SUBKEY, B1
aesenclast SUBKEY, B2
movups B0, (DST)
movups B1, 16(DST)
movups B2, 32(DST)
jmp .Lend
.Ltail2:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
movups 16(SRC), B1
pxor SUBKEY, B0
pxor SUBKEY, B1
.Lround_loop_2w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
aesenc SUBKEY, B1
jne .Lround_loop_2w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
aesenclast SUBKEY, B1
movups B0, (DST)
movups B1, 16(DST)
jmp .Lend
.Ltail1:
mov ROUNDS, CNT
movups -16(KEYS, ROUNDS), SUBKEY
movups (SRC), B0
pxor SUBKEY, B0
.Lround_loop_1w:
movups (KEYS, CNT), SUBKEY
add $16, CNT
aesenc SUBKEY, B0
jne .Lround_loop_1w
movups (KEYS), SUBKEY
aesenclast SUBKEY, B0
movups B0, (DST)
.Lend:
W64_EXIT(6, 5)
ret
EPILOGUE(_nettle_aes_encrypt)
C FIXME: Put in rodata?
ALIGN(4)
.Ljmptab:
.long .Lend - .Ljmptab
.long .Ltail1 - .Ljmptab
.long .Ltail2 - .Ljmptab
.long .Ltail3 - .Ljmptab
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs