Re: [WIP] aes arm asm from libgcrypt

Yuriy M. Kaminskiy Sun, 17 Mar 2019 02:43:01 -0700

On 17.03.2019 11:08, Niels Möller wrote:
> "Yuriy M. Kaminskiy" <[email protected]> writes:
> 
>> On raspberry pi 3b+ (cortex-a53 @ 1.4GHz):
>> Before:
>>  aes128         |  nanosecs/byte   mebibytes/sec   cycles/byte
>>         ECB enc |     39.58 ns/B     24.10 MiB/s         - c/B
>>         ECB dec |     39.57 ns/B     24.10 MiB/s         - c/B
>> After:
>>         ECB enc |     15.24 ns/B     62.57 MiB/s         - c/B
>>         ECB dec |     15.68 ns/B     60.80 MiB/s         - c/B
>>
>> Passes nettle regression test (only little-endian though)
> 
> Cool!
> 
>> Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache
>> footprint from 4.25K to 1K (enc)/1.25K (dec);
> 
> We could ficgure out a way to exclude unneeded tables in builds that
> unconditionally uses this code.
> 
> I think I tried this years ago, and found it slower, recorded in this
> comment:
> 
>   C It's tempting to use eor with rotation, but that's slower.
> 
> But things may have changed, or you're doing it in a different way than
> I tried. Have you benchmarked with small and large tables?


Well, it is not my code. I just taken it from libgcrypt, and adopted to
nettle (reordered arguments, added loop over blocks, changed decrypt
keyschedule to nettle's), without any major changes.

>> completely unrolled, so increases i-cache footprint
>> from 948b to 4416b (enc)/4032b (dec)
> 
> Do you have any numbers for the performance gain from unrolling?
> 
> With complete unrolling, it may be good with separate entry points (and
> possibly separate files) for aes128, aes192, aes256.

Now at least it reuses some code. If application calls both aes128 and aes256,
some i-cache is saved.

> I've been
> considering doing that for the x86_64/aesni (and then the old-style
> aes_encrypt needs to be changed to not use the _aes_encrypt function
> with a rounds argument; I have a branch doing that lying around
> somewhere).
> 
>> P.S. Yes, I tried convert macros to m4: complete failure (no named
>> parameters, problems with more than 9 arguments, weird expansion rules);
>> so I fallen back to good ol' gas. Sorry.
> 
> No named arguments may be a bit annoying. At least for the AES code, I
> see no macros with more than 9 arguments.

Some with 10:
.macro do_encround next_r ra rb rc rd rna rnb rnc rnd preload_key
.macro encround round ra rb rc rd rna rnb rnc rnd preload_key

(And I failed with groking m4 way of making indirect macro call)

>> define(<KEYSCHEDULE_REVERSED>,<yes>)
>> define(<IF_KEYSCHEDULE_REVERSED>,<ifelse(
>> KEYSCHEDULE_REVERSED,yes,<$1>,
>> KEYSCHEDULE_REVERSED,no,<$2>)>)
> 
> What is this for?

See FIXME comment in aes-invert-internal.c; original gcrypt code used unswapped
key schedule and walk backwards in aes_decrypt; for nettle port, I had to
change it, but left original code as an option (if nettle sometime decide to
follow FIXME and switch it).

BTW, mtable in aes-invert-internal is exactly same as 
_aes_decrypt_table.table[0],
it would be good to merge them.

(Another trick used in the last round of gcrypt's aes-encrypt:
   _aes_encrypt_table.sbox[i] == (_aes_encrypt_table.table[0][i]>> 8) & 0xff
   _aes_encrypt_table.sbox[i] == (_aes_encrypt_table.table[0][i]>>16) & 0xff
It does not use sbox and saves 256 bytes of d-cache footprint [but no same thing
with aes-decrypt])

>> C helper macros
>> .macro ldr_unaligned_le rout rsrc offs rtmp
>>      ldrb \rout, [\rsrc, #((\offs) + 0)]
>>      ldrb \rtmp, [\rsrc, #((\offs) + 1)]
>>      orr \rout, \rout, \rtmp, lsl #8
>>      ldrb \rtmp, [\rsrc, #((\offs) + 2)]
>>      orr \rout, \rout, \rtmp, lsl #16
>>      ldrb \rtmp, [\rsrc, #((\offs) + 3)]
>>      orr \rout, \rout, \rtmp, lsl #24
>> .endm
> 
> A different way to read unaligned data is to read aligned words, and
> rotate and shift on the fly. There's an example of this in
> arm/v6/sha256-compress.asm, using ldm, sel and ror, + some setup code
> and one extra register for keeping left-over bytes.

Actually, this is unused in the last version of code: armv6 supports
unaligned ldr, so I replaced
   if(aligned){ldm;IF_BE(rev); ) } else { ldr_unaligned_le }
with simple unconditional
    ldr; IF_BE(rev);
It was a bit faster with misaligned buffers, and almost same speed with
aligned buffers.
I've left it if someone will want to use this on armv5 (or it will turn out 
slower
on some other cpu).

>> PROLOGUE(_nettle_aes_decrypt)
>>      .cfi_startproc
>>      teq     PARAM_LENGTH, #0
>>      bxeq    lr
>>
>>      push {r0,r3,%r4-%r11, %ip, %lr}
>>      .cfi_adjust_cfa_offset 48
>>      .cfi_rel_offset r0, 0   C PARAM_LENGTH
>>      .cfi_rel_offset r3, 4   C PARAM_ROUNDS
>>      .cfi_rel_offset r4, 8
...
> 
> Are these .cfi_* pseudoops essential? I'm afraid I'm ignorant of the
> fine details here; I just see from the gas manual that they appear to be
> related to stack unwinding.

They are useful for gdb, valgrind, etc to produce sensible backtrace,
to be able to move up/down callchain (without losing values from callee-saved
registers, etc), and AFAIK they don't add any runtime overhead, so I add them
when possible (FWIW, they were not present in original gcrypt code).

P.S. There were stupid last-minute error in posted aes-encrypt, patch attached.

--- /home/yukam/Desktop/aes-encrypt-internal.asm	2019-03-16 13:13:58.000000000 +0300
+++ aes-encrypt-internal.asm	2019-03-17 11:45:04.110701283 +0300
@@ -263,16 +263,16 @@
 	ldr	RT0, FRAME_SRC
 
 ifelse(V6,V6,<
+	ldr RA, [RT0]
+	ldr RB, [RT0, #4]
+	ldr RC, [RT0, #8]
+	ldr RD, [RT0, #12]
   IF_BE(<
 	rev	RA, RA
 	rev	RB, RB
 	rev	RC, RC
 	rev	RD, RD
   >)
-	str RA, [RT0]
-	str RB, [RT0, #4]
-	str RC, [RT0, #8]
-	str RD, [RT0, #12]
 >,<
   IF_LE(<
 	C test if src is unaligned

_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: [WIP] aes arm asm from libgcrypt

Reply via email to