"Yuriy M. Kaminskiy" <[email protected]> writes:

> On raspberry pi 3b+ (cortex-a53 @ 1.4GHz):
> Before:
>  aes128         |  nanosecs/byte   mebibytes/sec   cycles/byte
>         ECB enc |     39.58 ns/B     24.10 MiB/s         - c/B
>         ECB dec |     39.57 ns/B     24.10 MiB/s         - c/B
> After:
>         ECB enc |     15.24 ns/B     62.57 MiB/s         - c/B
>         ECB dec |     15.68 ns/B     60.80 MiB/s         - c/B
>
> Passes nettle regression test (only little-endian though)

Cool!

> Does not use pre-rotated tables (as in AES_SMALL), so reduces d-cache
> footprint from 4.25K to 1K (enc)/1.25K (dec);

We could ficgure out a way to exclude unneeded tables in builds that
unconditionally uses this code.

I think I tried this years ago, and found it slower, recorded in this
comment:

  C It's tempting to use eor with rotation, but that's slower.

But things may have changed, or you're doing it in a different way than
I tried. Have you benchmarked with small and large tables?

> completely unrolled, so increases i-cache footprint
> from 948b to 4416b (enc)/4032b (dec)

Do you have any numbers for the performance gain from unrolling?

With complete unrolling, it may be good with separate entry points (and
possibly separate files) for aes128, aes192, aes256. I've been
considering doing that for the x86_64/aesni (and then the old-style
aes_encrypt needs to be changed to not use the _aes_encrypt function
with a rounds argument; I have a branch doing that lying around
somewhere).

> P.S. Yes, I tried convert macros to m4: complete failure (no named
> parameters, problems with more than 9 arguments, weird expansion rules);
> so I fallen back to good ol' gas. Sorry.

No named arguments may be a bit annoying. At least for the AES code, I
see no macros with more than 9 arguments.

> define(<KEYSCHEDULE_REVERSED>,<yes>)
> define(<IF_KEYSCHEDULE_REVERSED>,<ifelse(
> KEYSCHEDULE_REVERSED,yes,<$1>,
> KEYSCHEDULE_REVERSED,no,<$2>)>)

What is this for?

> C helper macros
> .macro ldr_unaligned_le rout rsrc offs rtmp
>       ldrb \rout, [\rsrc, #((\offs) + 0)]
>       ldrb \rtmp, [\rsrc, #((\offs) + 1)]
>       orr \rout, \rout, \rtmp, lsl #8
>       ldrb \rtmp, [\rsrc, #((\offs) + 2)]
>       orr \rout, \rout, \rtmp, lsl #16
>       ldrb \rtmp, [\rsrc, #((\offs) + 3)]
>       orr \rout, \rout, \rtmp, lsl #24
> .endm

A different way to read unaligned data is to read aligned words, and
rotate and shift on the fly. There's an example of this in
arm/v6/sha256-compress.asm, using ldm, sel and ror, + some setup code
and one extra register for keeping left-over bytes.

> PROLOGUE(_nettle_aes_decrypt)
>       .cfi_startproc
>       teq     PARAM_LENGTH, #0
>       bxeq    lr
>
>       push {r0,r3,%r4-%r11, %ip, %lr}
>       .cfi_adjust_cfa_offset 48
>       .cfi_rel_offset r0, 0   C PARAM_LENGTH
>       .cfi_rel_offset r3, 4   C PARAM_ROUNDS
>       .cfi_rel_offset r4, 8
>       .cfi_rel_offset r5, 12
>       .cfi_rel_offset r6, 16
>       .cfi_rel_offset r7, 20
>       .cfi_rel_offset r8, 24
>       .cfi_rel_offset r9, 28
>       .cfi_rel_offset r10, 32
>       .cfi_rel_offset r11, 36
>       .cfi_rel_offset ip, 40
>       .cfi_rel_offset lr, 44

Are these .cfi_* pseudoops essential? I'm afraid I'm ignorant of the
fine details here; I just see from the gas manual that they appear to be
related to stack unwinding.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to